Inference robust to outliers with l1-norm penalization

06/04/2019 ∙ by Jad Beyhum, et al. ∙ 0

This paper considers the problem of inference in a linear regression model with outliers where the number of outliers can grow with sample size but their proportion goes to 0. We apply the square-root lasso estimator penalizing the l1-norm of a random vector which is non-zero for outliers. We derive rates of convergence and asymptotic normality. Our estimator has the same asymptotic variance as the OLS estimator in the standard linear model. This enables to build tests and confidence sets in the usual and simple manner. The proposed procedure is also computationally advantageous as it amounts to solving a convex optimization program. Overall, the suggested approach constitutes a practical robust alternative to the ordinary least squares estimator.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper considers a linear regression model with outliers. The statistican observes a dataset of

i.i.d. realizations of an outcome scalar random variables

and a random vector of covariates with support in , such that is positive definite. We place ourselves in the Huber’s contamination framework, that is the distribution of

is a mixture between two distributions. With probability

, it corresponds to a linear regression model with conditionally homoscedastic errors, that is there exists and scalar i.i.d. random variables such that , and . With probability , the distribution is unspecified. An observation is called an outlier when it was generated according to this unspecified distribution . The goal of the statistician is to estimate the parameter . This model can be rewritten as


where is scalar random variable which is equal to when an observation is not an outlier and which dependence with and is left unrestricted. The probability that is different from is hence, . We derive estimation results in an asymptotic where goes to as a function of the sample size .

This model can represent various situations of practical interest. First, the statistician could be interested in because it corresponds to the slope of the best linear predictor of given for the observations for which . These coefficients are of interest because, in the presence of outliers, the slope of the best linear predictor of given for the whole population may differ greatly from and hence a statistical analysis based on the whole population may lead to a poor prediction accuracy for the large part of the population that are not outliers.

Second, if is given a causal interpretation, then it represents the causal effect of the regressors for the population of "standard" individuals. That is, for instance, if the aim is evaluate a program, it could be that the treatment effect is negative for most of the population but strongly positive for a small fraction of the individuals, the outliers. The policy maker may not be willing to implement a policy that has a negative effect on most of the population, giving interest to a statistical procedure that estimates the treatment effect of the majority of the population robustly.

Finally, could represent the true coefficient of the best linear predictor of given in a measurement errors model. Indeed, assume that our population follows the model with but that we do not observe but , this fits in our framework with and

Hence, allows for both measurement errors in - called outliers in the -direction - and in , the outliers in the -direction, for a small fraction of the population, see Rousseeuw and Leroy (2005) for a precise discussion.

This paper develop results on the estimation of when the vector is sparse in the sense that goes to with . We rely on a variant of the square-root lasso estimator of Belloni et al. (2011a) which penalizes the -norm of the vector . The advantages of our estimator are that the penalty parameter does not depend on the variance of the error term and is computationally tractable. If the vector is sparse enough, we show that our estimator is -consistent and asymptotically normal. It has the same asymptotic variance as the OLS estimator in the standard linear model without outliers.

Related literature. This paper is connected to at least two different research fields. First, it draws on the literature on inference in the high-dimensional linear regression model and closely related variants of this model. A series of papers from Belloni et al. (2011b, 2012, 2014a, 2014b, 2016, 2017)

study a variety of models ranging from panel data models to quantile regression in an high-dimensional setting.

Gautier et al. (2011) proposes inference procedures in an high-dimensional IV model with a large number of both regressors and instrumental variables. Javanmard and Montanari (2014); Van de Geer et al. (2014); Zhang and Zhang (2014)

suggest debiasing strategies of the lasso estimator to obtain confidence intervals in a high-dimensional linear regression model. We borrow from this literature by using an

-penalized estimator and complete existing research by deriving inference results for the linear regression model with outliers.

Next, our work is related to the literature on robust regression. For detailed accounts of this field, see Rousseeuw and Leroy (2005); Hampel et al. (2011); Maronna et al. (2018). The literature identifies a trade-off between efficiency and robustness, as explicited below. Indeed, -estimators (such as the Ordinary Least-Squares (OLS) estimator) are efficient when data is generated by the standard linear model without outliers and Gaussian errors but this comes at the price of a breakdown point - the maximum proportion of the data that can be contaminated without the estimator performing arbitrarily poorly - of . By contrast, -estimators such as the Least Median of Squares (LMS) and the Least Trimmed Squares (LTS) have a strictly positive and fixed breakdown point. They are also asymptotically normal in the model without outliers but are not efficient and have computational issues because of the non-convexity of their objective functions (see Rousseeuw and Leroy (2005)). Our estimator is efficient under certain conditions, because it attains the same asymptotic variance as the OLS estimator in the standard linear model. Unlike this literature, our procedure relies on a convex program and is computationally tractable, see Belloni et al. (2011a) for a detailed analysis. The proposed approach therefore provides a simple efficient alternative to the rest of the literature.

Within the robust regression literature some authors have considered the application of -norm penalization to robust estimation. In particular, our model nests the Huber’s contamination model for location estimation introduced in Huber et al. (1964). Indeed, if there is a single constant regressor, our model nests the following framework:

where i.i.d., is the mean of for non-outlying coefficients while is left unrestricted. Chen et al. (2018) show that the minimax lower bound for the squared -norm estimation error is of order greater than under gaussian errors, where is the number of outliers in the sample. When , we attain this lower bound up to a factor . Several strategies have been proposed to tackle this location estimation problem. The one which is the closest to ours is soft-thresholding using a lasso estimator, that is use

see for instance Collier and Dalalyan (2017). We substitute this estimator with a square-root lasso that has the advantage to provide guidance on the choice of the penalty level that is independent from the variance of the noise (see Belloni et al. (2011a)). We extend the analysis of this type of estimators to the linear regression model and add inference results to the literature. Other -norm penalized estimators for robust linear regression have been studied in the literature such as in Lambert-Lacroix et al. (2011); Dalalyan (2012); Li (2012); Alfons et al. (2013), but the authors do not provide inference results. Fan et al. (2017) considers robust estimation in the case where

is a high-dimensional parameter. Its estimator penalizes the Huber loss function by a term proportional to the

-norm of .

Notations. We use the following notations. For a matrix , is its transpose, is its -norm, is the -norm, is its sup-norm, is its operator norm and is the number of non-zero coefficients in , that is its -norm. For a probabilistic event , the fact that it happens w.p.a. (with probability approaching ) signifies that . Then, for , is the vector and is the matrix . is the projector on the vector space spanned by the columns of the matrix and , where

is the identity matrix of size

. We introduce and .

2 Low-dimensional linear regression

2.1 Framework

The probabilistic framework consists of a sequence of data generating processes (henceforth, DGPs) that depend on the sample size

. The joint distribution of

is independent from the sample size. We consider an asymptotic where goes to and where , the contamination level, depends on while the number of regressors remains fixed.

Our estimation strategy is able to handle models where is sparse, that is or, in other words, . Potentially, every individual’s can be generated by a distribution that does not follow a linear model but the difference between the distribution of and the one yielded by a linear model can only be important for a negligible proportion of individuals. Our subsequent theorems will help to quantify these previous statements.

2.2 Estimation procedure

We consider an estimation procedure that estimates both the coefficients and the effects of the regressors by a square-root lasso that penalizes only the coefficients , that is

where is a penalty level whose choice is discussed later. The advantage of the square-root lasso over the lasso estimator is that the penalty level does not depend on an estimate of the variance of . Hence, our procedure is simple in that it does not make use of any tuning parameter unlike the least median of quares and least trimmed squares estimators. An important remark is that if is such that , then

for any . Therefore, if is positive definite, is the OLS estimator of the regression of on , that is


Then, notice also that for all and , we have

Hence, because is feasible, it holds that


Under assumptions developed below, this procedure yields consistent estimation and asymptotic normality for . Remark that model (1) can be seen as a standard linear model with the coefficient

corresponding to the slope parameter of a dummy variable which value is

for the individual and otherwise. Hence, our analysis of the square-root lasso fits in the framework of Belloni et al. (2011a). However, our approach is met with additional technical difficulties because we penalize only a subset of the variables and there is no hope to estimate consistently as each of its entries is indirectly observed only once. As a result, we develop new assumptions and theorems that are better suited for the purposes of this paper.

2.3 Assumptions and results

The main assumption concerns the choice of the penalty level:

Assumption 2.1

We have

The tuning of prescribed by this penalty level depends on the distributional assumptions made on , in particular on the tails. The next lemma provides guidance on how to choose the regularization parameter according to assumptions on :

Lemma 2.1

It holds that . Additionally, if is such that and , then for any , satisfies Assumption 2.1.

The proof is given in Appendix A. This lemma suppresses the role of the matrix in the choice of the penalty and simplifies the decision procedure. It leads to the subsequent corollary:

Corollary 2.1

The following hold:

  • If are gaussian random variables, then satisfies Assumption 2.1 for any ;

  • If are sub-gaussian random variables, then there exists a constant such that satisfies Assumption 2.1;

  • If are sub-exponential random variables, then then there exists a constant such that satisfies Assumption 2.1.

The proof is given in Appendix A. The statistician can use Corollary 2.1 to decide on the penalization parameter given how heavy she expects the tails of the error term to be in her data. In practice, it is advised to choose the smallest penalty verifying Assumption 2.1. This can be done by Monte-Carlo simulations. Notice that our approach allows for heavy-tailed distributions such as sub-exponential random variables.

To derive the convergence rate of our estimator, we first bound the estimation error on and obtain the following result:

Lemma 2.2

Under Assumption 2.1 and if (and , it holds that

The proof is given in Appendix B. The rate of convergence of therefore is lower than if the errors are gaussian or sub-gaussian and we choose the penalty level as in Lemma 2.1. Note that, as standard in works related to the lasso estimator (see Bühlmann and Van De Geer (2011)), in our proof we make use of a compatibility condition that states that a compatibility constant is bounded from below with probability approaching one. The condition that is enough to show that this property holds as shown in Lemma B.1 in Appendix B. It is possible to find other sufficient conditions but it is outside the scope of this paper. Remark that if are i.i.d. sub-Gaussian random variables then allowing for the sparsity level .

Here, we show how to derive the rate of convergence of from Lemma 2.2. Assume that . Substituting by in (2), we obtain


Now, notice that

. By the law of large numbers,

, which implies that


By Lemma 2.2, this implies that

Finally, by the central limit theorem and Slutsky’s lemma, we have that

. This leads to Theorem 2.2.

Theorem 2.2

Under Assumption 2.1 and if , it holds that

This result allows to derive the rates of convergence under different assumptions on the tails of the distributions of the regressors and the error term. For instance, if and are i.i.d. sub-Gaussian random variables, then is consistent as long as for the choice of proposed in Lemma 2.1. In this case, this implies that our estimator reaches (up to a logarithmic factor) the minimax lower bound for the Huber’s contamination location model under gaussian errors, which is in -norm according to Chen et al. (2018). We attain the rate . Remark also that equation (5) explains the role of in the convergence rate of . For an individual , if is large then an error in the estimation of can contribute to an error in the estimation of via the term in (4). measures the maximum influence that an observation can have.

To show that our estimator is asymptotically normal, it suffices to assume that the bias term in (4) vanishes asymptotically:

Theorem 2.3

Under Assumption 2.1, assuming that (and ), we have

Moreover, and are consistent estimators of, respectively, and .

The proof that is given in Appendix C.When the entries of and are sub-Gaussian, for the choice of the penalty prescribed in Lemma 2.1, the contamination level needs to satisfy to be able to use 2.3 to prove asymptotic normality. Notice that the asymptotic variance of our estimator corresponds to the one of the OLS estimator in the standard linear model under homoscedasticity. Hence, confidence sets and tests can be built in the same manner as in the theory of the OLS estimator.

An important last remark concerns the meaning of confidence intervals developed using Theorem 2.3. Note that they are obtained under an asymptotic with triangular array data under which the number of outliers is allowed to go to infinity while the proportion of outliers goes to . The interpretation of a 95% confidence interval built with Theorem 2.3 is as follows: if the number of outliers in our data is low enough and the sample size is large enough, then there is a probability of approximatively that belongs to .

3 Computation and simulations

3.1 Iterative algorithm

We propose the following algorithm to compute our estimator. Because , as long as , we have that


This is a convex objective and we propose to iteratively minimize over , , and . Let us start from and compute the following sequence for until convergence:

  1. .

The following lemma explains how to perform step 2:

Lemma 3.1

For , if then . If then .

The proof is given in Appendix D.

3.2 Simulations

We apply this computation approach in a small simulation exercise. The data generating process is as follows: there are two regressors and , with for all and are i.i.d. random variables. are i.i.d. random variables. Then, we set

where is such that . In table 1, we present the bias, the variance, the mean squared error (MSE) and the coverage of confidence intervals for our estimator computed using the algorithm of Section 3.1, where we use iterations and with . This choice corresponds to the one outlined in Corollary 2.1. The bias, the variance and the coverage of confidence intervals for the naive OLS estimator:

are also reported. For the OLS estimator, the confidence intervals correspond to the ones of the standard linear model. The presented results are averages among replications. We observe that our estimator brings a substantial improvement in estimation precision with respect to the OLS estimator.

value p n
bias 0.025 100 0.127 0.301 0.278 0.671
variance 0.025 100 0.060 0.130 0.097 0.221
MSE 0.025 100 0.076 0.221 0.174 0.671
coverage 0.025 100 0.82 0.47 0.75 0.20
bias 0.01 1000 0.045 0.120 0.133 0.361
variance 0.01 1000 0.002 0.004 0.007 0.018
MSE 0.001 1000 0.004 0.018 0.025 0.148
coverage 0.001 1000 0.74 0.16 0.24 0.00
bias 0.001 10000 0.005 0.015 0.017
variance 0.001 10000 1.08 1.28 2.21 5.23
MSE 0.001 10000 1.33 3.53 5.10 3.772
coverage 0.001 10000 0.93 0.66 0.68 0.03
Table 1: . bias, variance, mean squared error (MSE) and coverage of confidence intervals for .


  • Alfons et al. (2013) Andreas Alfons, Christophe Croux, and Sarah Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, pages 226–248, 2013.
  • Belloni et al. (2011a) Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011a.
  • Belloni et al. (2011b) Alexandre Belloni, Victor Chernozhukov, et al. -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82–130, 2011b.
  • Belloni et al. (2012) Alexandre Belloni, Daniel Chen, Victor Chernozhukov, and Christian Hansen. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012.
  • Belloni et al. (2014a) Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives, 28(2):29–50, 2014a.
  • Belloni et al. (2014b) Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014b.
  • Belloni et al. (2016) Alexandre Belloni, Victor Chernozhukov, Christian Hansen, and Damian Kozbur. Inference in high-dimensional panel models with an application to gun control. Journal of Business & Economic Statistics, 34(4):590–605, 2016.
  • Belloni et al. (2017) Alexandre Belloni, Victor Chernozhukov, Ivan Fernández-Val, and Christian Hansen.

    Program evaluation and causal inference with high-dimensional data.

    Econometrica, 85(1):233–298, 2017.
  • Bühlmann and Van De Geer (2011) Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • Chen et al. (2018) Mengjie Chen, Chao Gao, Zhao Ren, et al. Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics, 46(5):1932–1960, 2018.
  • Collier and Dalalyan (2017) Olivier Collier and Arnak S Dalalyan. Rate-optimal estimation of p-dimensional linear functionals in a sparse gaussian model. arXiv preprint arXiv:1712.05495, 2017.
  • Dalalyan (2012) Arnak S Dalalyan. Socp based variance free dantzig selector with application to robust estimation. Comptes Rendus Mathematique, 350(15-16):785–788, 2012.
  • Fan et al. (2017) Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.
  • Gautier et al. (2011) Eric Gautier, Alexandre Tsybakov, and Christiern Rose. High-dimensional instrumental variables regression and confidence sets. arXiv preprint arXiv:1105.2454, 2011.
  • Giraud (2014) Christophe Giraud. Introduction to high-dimensional statistics. Chapman and Hall/CRC, 2014.
  • Hampel et al. (2011) Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
  • Huber et al. (1964) Peter J Huber et al. Robust estimation of a location parameter. The annals of mathematical statistics, 35(1):73–101, 1964.
  • Javanmard and Montanari (2014) Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • Lambert-Lacroix et al. (2011) Sophie Lambert-Lacroix, Laurent Zwald, et al. Robust regression through the huber?s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5:1015–1053, 2011.
  • Li (2012) Wei Li.

    Simultaneous variable selection and outlier detection using LASSO with applications to aircraft landing data analysis

    PhD thesis, Rutgers University-Graduate School-New Brunswick, 2012.
  • Maronna et al. (2018) Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Matías Salibián-Barrera. Robust statistics: theory and methods (with R). Wiley, 2018.
  • Rousseeuw and Leroy (2005) Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume 589. John wiley & sons, 2005.
  • Van de Geer et al. (2014) Sara Van de Geer, Peter Bühlmann, Ya’acov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  • Vershynin (2018) Roman Vershynin.

    High-dimensional probability: An introduction with applications in data science

    , volume 47.
    Cambridge University Press, 2018.
  • Zhang and Zhang (2014) Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.

Appendix A Choice of the penalization parameter

a.1 Proof of Lemma 2.1

We start by proving the next two technical lemmas:

Lemma A.1

It holds that .

Proof. Because of the assumptions on the joint distribution of , we have that , therefore . Because , we obtain that , by the law of large numbers.

Lemma A.2

It holds that .

Proof. First, remark that, by the theorem of Pythagore,

Now, this leads to . Because and are i.i.d. and , we have that and . This implies that . We also have that , which leads to . We conclude by the continuous mapping theorem.

Now, we proceed with the proof of Lemma 2.1. Notice that

Using lemmas A.1 and A.2, we obtain


The rest of the lemma is a direct consequence of (7) and the pigeonhole principle.

a.2 Proof of Corollary 2.1

Proof of (i) By Lemma 2.1 it is sufficient to show that for ,

Let us remember the gaussian bound (see Lemma B.1 in Giraud (2014)): for , we have

Then, we have

Proof of (ii) By Lemma 2.1 it is sufficient to show that there exists such that

Let us remember the sub-gaussian bound (see Proposition 2.5.2 in Vershynin (2018)): for , there exists such that

Then, we have

Proof of (iii) By Lemma 2.1 it is sufficient to show that there exists such that

Let us remember the sub-exponential bound (see Proposition 2.7.1 in Vershynin (2018)): for , there exists such that

Then, we have, for large enough,

Appendix B Proof of lemma 2.2

b.1 Compatibility constant

For , we denote by the vector for which if and otherwise. Let us also define . We introduce the following cone:

We work with the following compatibility constant (see Bühlmann and Van De Geer (2011) for a discussion of the role of compatibility conditions in the lasso literature) corresponding to

We use the following lemma:

Lemma B.1

If , there exists such that w.p.a. .

Proof. Take , to show this result, notice that

Therefore, we have


where . Next, we have that

Now, because we have by the law of large numbers, we obtain that and that , both implying that. We conclude the proof using that .

b.2 End of the proof of Lemma 2.2

Throughout this proof, we work on the event