Outliers are a fundamental problem in statistical data analysis. Roughly speaking, an outlier is an observation point that differs from the data’s “global picture” . A rule of thumb is that a typical dataset may contain between 1% and 10% of outliers , or much more than that in specific applications such as web data, because of the inherent complex nature and highly uncertain pattern of users’ web browsing . This outliers problem was already considered in the early 50’s [15, 21] and it motivated in the 70’s the development of a new field called robust statistics [28, 29].
In this paper, we consider the problem of linear regression in the presence of outliers. In this setting, classical estimators, such as the least-squares, are known to fail 
. In order to conduct regression analysis in the presence of outliers, roughly two approaches are well-known. The first is based on detection and removal of the outliers to fit least-squares on “clean” data. Popular methods rely on leave-one-out methods (sometimes called case-deletion), first described in  with the use of residuals in linear regression. The main issue about these methods is that they are theoretically well-designed for the situations where only one given observation is an outlier. Repeating the process across all locations can lead to well-known masking and swamping effects . An interesting recent method that does not rely on a leave-one-out technique is the so-called IPOD 
, a penalized least squares method with the choice of tuning parameter relying on a BIC criterion. A second approach is based on robust regression, that considers loss functions that are less sensitive to outliers. This relies on the -estimation framework, that leads to good estimators of regression coefficients in the presence of outliers, thanks to the introduction of robust losses replacing the least-squares. However, the computation of -estimates is substantially more involved than that of the least-squares estimates, which to some extend counter-balance the apparent computational gain over previous methods. Moreover, robust regression focuses only on the estimation of the regression coefficients, and does not allow directly to localize the outliers, see also for instance  for a recent review.
Alternative approaches have been proposed to perform simultaneously outliers detection and robust regression. Such methods involve median of squares , S-estimation  and more recently robust weighted least-squares , among many others, see also  for a recent review on such methods. The development of robust methods intersected with the development of sparse inference techniques recently. Such inference techniques, in particular applied to high-dimensional linear regression, are of importance in statistics, and have been an area of major developments over the past two decades, with deep results in the field of compressed sensing, and more generally convex relaxation techniques [45, 9, 10, 12, 11]. These led to powerful inference algorithms working under a sparsity assumption, thanks to fast and scalable convex optimization algorithms . The most popular method allowing to deal with sparsity and variable selection is the LASSO , which is -penalized least-squares, with improvements such as the Adaptive LASSO , among a large set of other sparsity-inducing penalizations [7, 3].
Within the past few years, a large amount of theoretical results have been established to understand regularization methods for the sparse linear regression model, using so-called oracle inequalities for the prediction and estimation errors [30, 31, 33], see also [7, 20]
for nice surveys on this topic. Another line of works focuses on variable selection, trying to recover the support of the regression coefficients with a high probability[32, 31, 13]. Other types of loss functions  or penalizations [17, 5] have also been considered. Very recently, the sorted- norm penalization has been introduced [5, 6, 42] and very strong statistical properties have been shown. In particular, when covariates are orthogonal, SLOPE allows to recover the support of the regression coefficients with a control on the False Discovery Rate 
. For i.i.d covariates with a multivariate Gaussian distribution, oracle inequalities with optimal minimax rates have been shown, together with a control on a quantity which is very close to the FDR. For more general covariate distributions, oracle inequalities with an optimal convergence rate are obtained in .
However, many high-dimensional datasets, with hundreds or thousands of covariates, do suffer from the presence of outliers. Robust regression and detection of outliers in a high-dimensional setting is therefore important. Increased dimensionality and complexity of the data may amplify the chances of an observation being an outlier, and this can have a strong negative impact on the statistical analysis. In such settings, many of the aforementioned outlier detection methods do not work well. A new technique for outliers detection in a high-dimensional setting is proposed in , which tries to find the outliers by studying the behavior of projections from the data set. A small set of other attempts to deal with this problem have been proposed in literature [48, 37, 23, 39, 18], and are described below with more details.
2 Contributions of the paper
Our focus is on possibly high dimensional linear regression where observations can be contaminated by gross errors. This so-called mean-shifted outliers model can be described as follows:
for , where is the sample size. A non-zero means that observation is an outlier, and , , and respectively stand for the linear regression coefficients, vector of covariates, label and noise of sample
. For the sake of simplicity we assume throughout the paper that the noise is i.i.d centered Gaussian with known variance.
2.1 Related works
We already said much about the low-dimensional problem so we focus in this part on the high-dimensional one. The leave-one-out technique has been extended in  to high-dimension and general regression cases, but the masking and swamping problems remains. In other models, outliers detection in high-dimension also includes distance-based approaches  where the idea is to find the center of the data and then apply some thresholding rule. The model (2.1) considered here has been recently studied with LASSO penalizations  and hard-thresholding . LASSO was used also in , but here outliers are modeled in the variance of the noise. In [23, 39], that are closer to our approach, the penalization is applied differently: in , the procedure named Extended-LASSO uses two different penalties for and , with tuning parameters that are fixed according to theoretical results, while the IPOD procedure from  applies the same penalization to both vectors, with a regularization parameter tuned with a modified BIC criterion. In , error bounds and a signed support recovery result are obtained for both the regression and intercepts coefficients. However, these results require that the magnitude of the coefficients is very large, which is one of the issues that we want to overcome with this paper.
It is worth mentioning that model (2.1) can be written in a concatenated form , with being the concatenation of the covariates matrix (with lines given by the
’s) and the identity matrixin , and being the concatenation of and . This leads to a regression problem with a very high dimension for the vector . Working with this formulation, and trying to estimate directly is actually a bad idea. This point is illustrated experimentally in , where it is shown that applying two different LASSO penalizations on and leads to a procedure that outperforms the LASSO on the concatenated vector. The separate penalization is even more important in case of SLOPE, whose aim is FDR control for the support recovery of . Using SLOPE directly on would mix the entries of and together, which would make FDR control practically impossible due to the correlations between covariates in the matrix.
2.2 Main contributions
Given a vector with non-negative and non-increasing entries, we define the sorted- norm of a vector as
where . In  and  the sorted- norm was used as a penalty in the Sorted L-One Penalized Estimator (SLOPE) of coefficients in the multiple regression. Degenerate cases of SLOPE are -penalization whenever are all equal to a positive constant, and null-penalization if this constant is zero. We apply two different SLOPE penalizations on and , by considering the following optimization problem:
where and are positive parameters, is the covariates matrix with rows , , , is the Euclidean norm of a vector and and are two vectors with non-increasing and non-negative entries.
In this artice we provide the set of sequences and which allow to obtain better error bounds for estimation of and than previously known ones , see Section 3 below. Moreover, in Section 4 we provide specific sequences which, under some asymptotic regime, lead to a control of the FDR for the support selection of , and such that the power of the procedure (2.3) converges to one. Procedure (2.3) is therefore, to the best of our knowledge, the first proposed in literature to robustly estimate , estimate and detect outliers at the same time, with a control on the FDR for the multi-test problem of support selection of , and power consistency.
We compare in Section 5 our procedure to the recent alternatives for this problem, that is the IPOD procedure  and the Extended-Lasso . The numerical experiments given in Section 5 confirm the theoretical findings from Sections 3 and 4. As shown in our numerical experiments, the other procedures fail to guarantee FDR control or exhibit a lack of power when outliers are difficult to detect, namely when their magnitude is not far enough from the noise-level. It is particularly noticeable that our procedure overcomes this issue.
The theoretical results proposed in this paper are based on two popular assumptions in compressed sensing or other sparsity problems, similar to the ones from 
: first, a Restricted Eigenvalues (RE) condition on , then a mutual incoherence assumption  between and , which is natural since is excludes settings where the column spaces of and are impossible to distinguish. Proofs of results stated in Sections 3 and 4 are given in Section C and D, while preliminary results are given in Sections A and B. Section E provides contains supplementary extra numerical results.
3 Upper bounds for the estimation of and
Throughout the paper, is the sample size whereas is the number of covariables, so that . For any vector , , and denote respectively the number of non-zero coordinates of (also called sparsity), the -norm and the Euclidean norm. We denote respectively by and the smallest and largest eigenvalue of a symmetric matrix . We work under the following assumption
We assume the following sparsity assumption:
for some positive integers and , and we assume that the columns of are normalized, namely for , where stands for the -th element of the canonical basis.
For the results of this Section, we consider procedure (2.3) with the following choice of :
for , and we consider three possibilities for , corresponding to no penalization, penalization and SLOPE penalization on .
Table 1 below gives a quick view of the convergence rates of the squared estimation errors of and obtained in Theorems 3.4, 3.5 and 3.6. We give also the convergence rate obtained in  for penalization applied to and . In particular, we see that using two SLOPE penalizations leads to a better convergence rate than the use of penalizations.
Condition 3.2 below is a Restricted Eigenvalue (RE) type of condition which is adapted to our problem. Such an assumption is known to be mandatory in order to derive fast rates of convergence for penalizations based on the convex-relaxation principle .
Consider two vectors and with non-increasing and positive entries, and consider positive integers and . We define the cone of all vectors satisfying
We also define the cone of all vectors satisfying
We assume that there are constants with such that satisfies the following, either for all or for all :
which actually corresponds to a RE condition on and that Equation (3.5) is satisfied if satisfies a RE condition with constant . Finally, note that Equation (3.6), called mutual incoherence in the literature of compressed sensing, requires in this context that for all and from the respective cones the potential regression predictor is sufficiently not-aligned with potential outliers . An extreme case of violation of this assumption occurs when , where we cannot separate the regression coefficients from the outliers.
The Condition 3.2 is rather mild and e.g. for a wide range of random designs. Specifically, Theorem 3.3 below, shows that it holds with large probability whenever has i.i.d rows, with , and the vectors and are sufficiently sparse.
Let be a random matrix with i.i.d
be a random matrix with i.i.drows and . Let be the corresponding matrix with normalized columns. Given positive integers and , define . If
then there are such that for any , we have
with a probability greater than . These inequalities also hold for any when is replaced by in the above conditions.
The proof of Theorem 3.3 is given in Appendix C.1. It is based on recent bounds results for Gaussian random matrices . The numerical constants in Theorem 3.3 are far from optimal and chosen for simplicity so that as required in Assumption 3.2. A typical example for is the Toeplitz matrix with , for which is equal to . The required lower bound on is non-restrictive, since and correspond to the sparsity of and , that are typically much smaller than . Note also that will only be used in low dimension, and in this case is again much smaller than .
Let us define for the whole Section, with and defined in Assumption 3.2. The three theorem below and their proof are very similar in nature, but differ in some details, therefore are stated and proved separately. We emphasize that the proof give slightly more general versions of the theorems, allowing the same result with having any given support containing . This is of great theoretical interest and is a key point for the support detection of investigated in 4. The proof use a very recent bound on the inner product between a white Gaussian noise and any vector, involving the sorted norm . Our first result deals with linear regression with outliers and no sparsity assumption on . We consider procedure (2.3) with no penalization on , namely
where , is given by (2.2), and where
The proof of Theorem 3.6 is given in Appendix C.4. Note that according to Theorem 3.3, the assumptions of Theorem 3.6 are satisfied with probability converging to one when the rows of are i.i.d from the multivariate Gaussian distribution with the positive definite covariance matrix, and when the signal is sparse such that .
4 Asymptotic FDR control and power for the selection of the support of
We consider the multi-test problem with null-hypotheses
for , and we consider the multi-test that rejects whenever , where (and ) are given either by (3.7), (3.8) or (3.9). When is rejected, or “discovered”, we consider that sample is an outlier. Note however that in this case, the value of gives extra information on how much sample is oulying.
We use the FDR as a standard Type I error for this multi-test problem. The FDR is the expectation of the proportion of falses discoveries among all discoveries. Letting (resp. ) be the number of false rejections (resp. the number of rejections), the FDR is defined as
We use the Power to measure the Type II error for this multi-test problem. The Power is the expectation of the proportion of true discoveries. It is defined as
the Type II error is then given by .
For the linear regression model without outliers, a multi-test for the support selection of with controlled FDR based on SLOPE is given in  and . Specifically, it is shown that SLOPE with weights
for , where
is the cumulative distribution function ofand , controls FDR at the level in the multiple regression problem with orthogonal design matrix . It is also observed that when the columns of are not orthogonal but independent the weights have to be substantially increased to guarantee FDR control. This effect results from the random correlations between columns of and the shrinkage of true nonzero coefficients, and in context of LASSO have been thoroughly discussed in .
In this paper we substantially extend current results on FDR controlling properties of SLOPE. Specifically, Theorem 4.1 below gives asymptotic controls of and for the procedures (3.7), (3.8) and (3.9), namely different penalizations on and SLOPE applied on , with slightly increased weights
where , see also . This choice of also yields optimal convergence rates, however considering it in Section 3 would lead to some extra technical difficulties. Under appropriate assumptions on , the signal sparsity and the magnitude of outliers, Theorem 4.1 not only gives FDR control, but also proves that the Power is actually going to .
Note that all asymptotics considered here are with respect to the sample size , namely the statement means that with .
Suppose that there is a constant such that the entries of satisfy for all , and suppose that
Then, the following properties hold:
The proof of Theorem 4.1 is given in Appendix D. It relies on a careful look at the KKT conditions, also known as the dual-certificate method  or resolvent solution . The assumptions of Theorem 4.1 are natural. The boundedness assumption on the entries of are typically satisfied with a large probability when is Gaussian for instance. When , it is also natural to assume that (let us recall that stands for the sparsity of the sample outliers ). The asymptotic assumptions roughly ask for the rates in Table 1 to converge to zero. Finally, the assumption on the magnitude of the non-zero entries of is somehow unavoidable, since it allows to distinguish outliers from the Gaussian noise. We emphasize that good numerical performances are actually obtained with lower magnitudes, as illustrated in Section 5.2.
5 Numerical experiments
In this section, we illustrate the performance of procedure (3.7) and procedure (3.9) both on simulated and real-word datasets, and compare them to several state-of-the art baselines described below. Experiments are done using the open-source tick library, available at https://x-datainitiative.github.io/tick/, notebooks allowing to reproduce our experiments are available on demand to the authors.
5.1 Considered procedures
We consider the following baselines, featuring the best methods available in literature for the joint problem of outlier detection and estimation of the regression coefficients, together with the methods introduced in this paper.
This is Extended LASSO from , that uses two dedicated -penalizations for and with respective tuning parameters and .
This is (soft-)IPOD from . It relies on a nice trick, based on a decomposition of . Indeed, write and let be formed by column vectors completing the column vectors of into an orthonormal basis, and introduce . Model (2.1) can be rewritten as , a new high-dimensional linear model, where only is to be estimated. The IPOD then considers the Lasso procedure applied to this linear model, and a BIC criterion is used to choose the tuning parameter of
-penalization. Note that this procedure, which involves a QR decomposition of, makes sense only for significantly smaller than , so that we do not report the performances of IPOD on simulations with a large .
Same as IPOD but with tuning parameter for the penalization of individual intercepts chosen by cross-validation. As explained above, cross-validation is doomed to fail in the considered model, but results are shown for the sake of completeness.
It is SLOPE applied to the concatenated problem, namely , where is the concatenation of and and is the concatenation of and . We use a single SLOPE penalization on , with weights equal to (4.3). We report the performances of this procedure only in high-dimensional experiments, since it always penalizes . This is considered mostly to illustrate the fact that working on the concatenated problem is indeed a bad idea, and that two distinct penalizations must be used on and .
Note that the difference between IPOD and E-LASSO is that, as explained in , the weights used for E-LASSO to penalize (and in high-dimension) are fixed, while the weights in IPOD are data-dependent. Another difference is that IPOD do not extend well to a high-dimensional setting, since its natural extension (considered in ) is a thresholding rule on the the concatenated problem, which is, as explained before, and as illustrated in our numerical experiments, poorly performing. Another problem is that there is no clear extension of the modified BIC criterion proposed in  for high-dimensional problems.
The tuning of the SLOPE or penalizations in the procedure described above require the knowledge of the noise level. We overcome this simply by plugging in (4.3) or wherever it is necessary a robust estimation of the variance: we first fit a Huber regression model, and apply a robust estimation of the variance of its residuals. All procedures considered in our experiments use this same variance estimate.
The noise level can be estimated directly by the Huber regression since in our simulations . When is comparable or larger than and the signal (both and ) is sufficiently sparse one can jointly estimate the noise level and other model parameters in the spirit of scaled LASSO . The corresponding iterative procedure for SLOPE was proposed and investigated in  in the context of high-dimensional regression with independent regressors. However, we feel that the problem of selection of the optimal estimator of in ultra-high dimensional settings still requires a separate investigation and we postpone it for a further research.
5.2 Simulation settings
The matrix is simulated as a matrix with i.i.d row distributed as , with Toeplitz covariance for , with moderate correlation . Some results with higher correlation are given in Appendix E. The columns of are normalized to . We simulate observations according to model (2.1) with and . Two levels of magnitude are considered for : low-magnitude, where and large-magnitude, where . In all reported results based on simulated datasets, the sparsity of varies between to , and we display the averages of FDR, MSE and power over 100 replications.
Setting 1 (low-dimension)
This is the setting described above with . Here .
Setting 2 (high-dimension)
This is the setting described above with and a sparse with sparsity , with non-zero entries chosen uniformly at random.
5.4 Results and conclusions on simulated datasets
In the low dimensional setting, our procedure
allows for almost perfect FDR control and its MSE is the smallest among all considered methods. Note that in this setting the MSE is plotted after debiasing the estimators, performing ordinary least squares on the selected support.
In the sparse (on ) high dimensional setting with correlated regressors, E-SLOPE allows to keep FDR below the nominal level even when the outliers consist 50% of the total data points. It also allows to maintain a small MSE and high power. The only procedure that improves E-SLOPE in terms of MSE for is SLOPE in Figure 3, at the cost of a worse FDR control.
E-SLOPE provides a massive gain of power compared to previous state-of-the-art procedures (power is increased by more than ) in settings where outliers are difficult to detect.
5.5 PGA/LPGA dataset
This dataset contains Distance and Accuracy of shots, for PGA and LPGA players in 2008. This will allow us to visually compare the performance of IPOD, E-LASSO and E-SLOPE. Our data contain 197 points corresponding to PGA (men) players, to which we add 8 points corresponding to LPGA (women) players, injecting outliers. We apply SLOPE and LASSO on with several levels of penalization. This leads to the “regularization paths” given in the top plots of Figure 4, that shows the value of the 205 sample intercepts as a function of the penalization level used in SLOPE and LASSO. Vertical lines indicate the choice of the parameter according the corresponding method (E-SLOPE, E-LASSO, IPOD). We observe that E-SLOPE correctly discover the confirmed outliers (women data), together with two men observations that can be considered as outliers in view of the scatter plot. IPOD procedure does quite good, with no false discovery, but misses some real outliers (women data) and the suspicious point detected by E-SLOPE. E-LASSO does not make any false discovery but clearly reveals a lack of power, with only one discovery.
5.6 Retail Sales Data
This dataset is from the U.S. census Bureau, for year 1992. The informations contained in it are the per capita retail sales of 845 US counties (in $1000s). It also contains five covariates: the per capita retail establishments, the per capita income (in $1000s), per capita federal expenditures (in $1000s), and the number of males per 100 females. No outliers are known, so we artificially create outliers by adding a small amount (magnitude , random sign) to the retail sales of counties chosen uniformly at random. We consider various scenarii (from to of outliers) and compute the false discovery proportion and the power. Figure 5 below summarizes the results for the three procedures.
The results are in line with the fact that E-SLOPE is able to discover more outliers than its competitors. E-SLOPE has the highest power, and the FDP remains under the target level.
5.7 Colorectal Cancer Data
We consider whole exome sequencing data for 47 primary colorectal cancer tumors, characterized by a global genomic instability affecting repetitive DNA sequences (also known as microsatellite instable tumors, see ). In what follows, we restrict ourselves to repetitive sequences whose base motif is the single nucleotide A, and which are in regulatory regions (following the coding regions) that influence gene expression (UTR3). Same analysis could be run with different base motifs and different regions (exonic, intronic). It has been shown in recent publications (see ), that the probability of mutation of a sequence is dependent of the length of the repeat. So we fit, after a rescaled probit transformation, our mean-shift model with an intercept and the length of the repeat as covariates. The aim of the analysis is to find two categories of sequences: survivors (multi-satellites that mutated less than expected) and transformators (multi-satellites that mutated more than expected), with the idea that those sequences must play a key role in the cancer development.
We fix the FDR level , results are shown in Figure 6: blue dots are the observed mutation frequencies of each gene among the 47 tumors, plotted as a function of the repeat length of the corresponding gene, our discoveries are hightlighted in red.
We made 37 discoveries, and it is of particular interest to see that our procedure select both ”obvious” outliers and more ”challenging” observations that are discutable with the unaided eye. We also emphasize that with the IPOD procedure and the LASSO procedure described in the previous paragraph, respectively 32 and 22 discoveries were made, meaning that even with this stringent FDR level, our procedure allow us to make about more discoveries than IPOD.
This paper introduces a novel approach for simultaneous robust estimation and outliers detection in the linear regression model. Three main results are provided: optimal bounds for the estimation problem in Section 3, that improve in particular previous results obtained with LASSO penalization , and asymptotic FDR control and power consistency for the outlier detection problem in Section 4. To the best of our knowledge, this is the first result involving FDR control in this context.
Our theoretical foundings are confirmed on intensive experiments both on real and synthetic datasets, showing that our procedure outperforms existing procedure in terms of power, while maintaining a control on the FDR, even in challenging situations such as low-magnitude outliers, a high-dimensional setting and highly correlated features.
Appendix A Technical inequalities
The following technical inequalities are borrowed from , where proofs can be found. Let be positive integers. In the following lemma, an inequality for the sorted -norm (defined in equation 2.2) is stated.
For any two , and any such that we have
The following lemma gives a preliminary bound for the prediction error in our context, that are the starting point of our proof.
Let be a convex function. Consider a design matrix , a vector and define where . If is a solution of the minimization problem , then satisfies:
Because the proof in  is more general, we give a proof adapted to our context. Optimality of allows to choose in the subdifferential of sucht that
Now, by definition of subdifferential, . Combining this inequality with the previous equality leads to the conclusion. ∎
The following lemma allows to bound the inner product between a white Gaussian noise and any vector. The resulting bound involved the sorted norm.
Let and let with columns normed to . If is distributed, then the event
is of probability at least , where
Appendix B Results related to Gaussian matrices
Inequalities for Gaussian random matrices are needed in what follows. They are stated here for the sake of clarity and we refer the reader to  for proof (except bounds B.4 and B.5 that are taken from Lemma 1 in ). Again, and denote positive integers.
Let with i.i.d rows. Denote by the largest singular value of
the largest singular value of. Then, for all ,
Let be distributed. Then for all :
Let be independent and distributed. Then for all :
Let be distributed. Then, for all :
The following recent result (, Theorem 1) will also be useful.
Let with i.i.d rows. There exists positive constants and such that with probability greater than , we have for all :
where is the lowest eigenvalue of .
Appendix C Proof of Section 3
This section is devoted to the proof of our main results, stated in Section 3.
c.1 Proof of Theorem 3.3
Define the diagonal matrix such that ( is the diagonal matrix formed by the inverse of the norm of each column of ). Applying now Lemma B.3 for and we obtain for all
with probability greater than , where and denote respectively the maximum and minimum of the norms of the columns of . Note that for all , the squared norm of the column of is distributed, so using the bounds B.4 and B.5 of Lemma B.2 (respectively with and ), together with a union bound we obtain that with probability greater than
and we eventually obtain
Let define (see Definition 3.3). Then,
Thus we obtain
and thus, using the fact that ,
Now if , Equation (C.1) together with the inequality lead to
and we conclude as above. Thus the first part of the theorem is satisfied.
Now, we must lower bound the scalar product .
Divide with () of cardinality containing the support of the largest absolute values of and of cardinality the support of the remaining values. Divide in the same way (of cardinalitys ) with respect to the largest absolute values of ( and to be chosen later). We use this to lower bound the scalar product: