# High Dimensional Robust Sparse Regression

We provide a novel -- and to the best of our knowledge, the first -- algorithm for high dimensional sparse regression with corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators with sub-linear sample complexity. Our algorithm consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: it is orderwise more efficient computationally than existing algorithms, and succeeds with high probability, thus making it suitable for general use in iterative algorithms. We demonstrate the effectiveness on large-scale sparse regression problems with arbitrary corruptions.

## Authors

• 53 publications
• 7 publications
• 6 publications
• 43 publications
• ### High Dimensional Robust Estimation of Sparse Models via Trimmed Hard Thresholding

We study the problem of sparsity constrained M-estimation with arbitrary...
01/24/2019 ∙ by Liu Liu, et al. ∙ 0

• ### Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering

We study high-dimensional sparse estimation tasks in a robust setting wh...
11/19/2019 ∙ by Ilias Diakonikolas, et al. ∙ 26

• ### Robust High Dimensional Sparse Regression and Matching Pursuit

We consider high dimensional sparse regression, and develop strategies a...
01/12/2013 ∙ by Yudong Chen, et al. ∙ 0

• ### Computationally Efficient Robust Estimation of Sparse Functionals

Many conventional statistical procedures are extremely sensitive to seem...
02/24/2017 ∙ by Simon S. Du, et al. ∙ 0

• ### Interaction Hard Thresholding: Consistent Sparse Quadratic Regression in Sub-quadratic Time and Space

Quadratic regression involves modeling the response as a (generalized) l...
11/08/2019 ∙ by Shuo Yang, et al. ∙ 0

• ### Efficient Background Modeling Based on Sparse Representation and Outlier Iterative Removal

Background modeling is a critical component for various vision-based app...
01/23/2014 ∙ by Linhao Li, et al. ∙ 0

• ### Being Robust (in High Dimensions) Can Be Practical

Robust estimation is much more challenging in high dimensions than it is...
03/02/2017 ∙ by Ilias Diakonikolas, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Learning in the presence of arbitrarily (even adversarially) corrupted outliers in the training data has a long history in Robust Statistics [25, 22, 39], and has recently received much renewed attention. The high dimensional setting poses particular challenges as outlier removal via preprocessing is essentially impossible when the number of variables scales with the number of samples. We propose a computationally efficient estimator for outlier-robust sparse regression that has near-optimal sample complexity, and is the first algorithm resilient to a constant fraction of arbitrary outliers with corrupted covariates and/or response variables. Unless we specifically mention otherwise, all future mentions of outliers mean corruptions in covariates and/or response variables.

We assume that the authentic samples are independent and identically distributed (i.i.d.) drawn from an uncorrupted distribution , where represents the linear model , where , and is the true parameter (see Section 1.3 for complete details and definitions). To model the corruptions, the adversary can choose an arbitrary -fraction of the authentic samples, and replace them with arbitrary values. We refer to the observations after corruption as -corrupted samples (Definition 1.1). This corruption model allows the adversary to select an -fraction of authentic samples to delete and corrupt, hence it is stronger than Huber’s -contamination model [24], where the adversary independently corrupts each sample with probability .

Outlier-robust regression is a classical problem within robust statistics (e.g., [36]), yet even in the low-dimensional setting, efficient algorithms robust to corruption in the covariates have proved elusive, until recent breakthroughs in [35, 14] and [28], which built on important results in Robust Mean Estimation [12, 30] and Sums of Squares [3], respectively.

In the sparse setting, the parameter we seek to recover is also -sparse, and a key goal is to provide recovery guarantees with sample complexity scaling with , and sublinearly with . Without outliers, by now classical results (e.g., [17]) show that samples are enough to give recovery guarantees on with and without additive noise. In the additive noise setting, exact recovery of with a constant -fraction of arbitrary outliers does not seem to be possible. Instead, we seek to give an efficient, sample-complexity optimal algorithm that recovers to within accuracy depending on , the fraction of outliers. In the case of no additive noise, we seek algorithms that can guarantee exact recovery, independent of .

### 1.1 Related work

The last 10 years have seen a resurgence in interest in robust statistics, including the problem of resilience to outliers in the data. Important problems attacked have included PCA [29, 44, 43, 30, 12], and more recently robust regression (as in this paper) [35, 14, 16, 28] and robust mean estimation [12, 30, 1], among others. We focus now on the recent work most related to the present paper.

#### Robust regression.

Earlier work in robust regression considers corruption only in the output, and shows that algorithms nearly as efficient as for regression without outliers succeeds in parameter recovery, even with a constant fraction of outliers [31, 34, 5, 6]. Yet these algorithms (and their analysis) focus on corruption in , and do not seem to extend to the setting of corrupted covariates – the setting of this work.

In the low dimensional setting, there has been remarkable recent progress. The work in [28] shows that the Sum of Squares (SOS) based semidefinite hierarchy can be used for solving robust regression. In fact, their results apply to a corruption model that allows the adversary to select an -fraction of points to corrupt, and hence is stronger than Huber’s -contamination model (where the corrupted fraction is chosen at random). Essentially concurrent to the SOS work, [35, 14] develop a framework for robust gradient descent for empirical risk minimization, by using robust mean estimation as a subroutine to compute robust gradients at each iteration. Though this latter work applies to Huber’s -contamination model, on the other hand, computationally it scales better than the algorithms in [28], as although the Sum of Squares SDP framework gives polynomial time algorithms, they are often not practical [23].

Much less appears to be known in the high-dimensional regime. Work in [11] first analyzed high dimensional sparse regression with arbitrary corruptions in covariates. They show that replacing the standard inner product in Matching Pursuit with a trimmed version, one can recover from an -fraction of outliers, with . [20] considered robust sparse regression under the Huber -contamination model by optimizing Tukey depth [39, 10], Their results reveal that handling a constant fraction of outliers () is actually minimax-optimal. However, computing the Tukey depth takes exponential time [27]. Very recently, [33] considered more general sparsity constrained -estimation by using a trimmed estimator in each step of gradient descent, yet the robustness guarantee is still sub-optimal.

The more recent work in low-dimension regression discussed above does not shed light on the sparse (high dimensional) setting. In particular, it is not clear how to extend the techniques in [35] and [14, 16] to the sparse case, while maintaining correctly scaling sample complexity, as they all require sample complexity at least . Furthermore, [16] stated a quadratic tradeoff in sample complexity that any polynomial time statistical query (SQ) algorithm for low dimensional robust regression with unknown covariance requires at least samples to achieve minimax-optimal error , where

is the standard deviation of the random noise.

Another approach follows as a byproduct of a recent algorithm for robust sparse mean estimation, in [1]. However, their error guarantee scales with , and moreover, does not provide exact recovery

in the adversarial corruption case without stochastic noise (i.e., noise variance

). We note that this is an inevitable consequence of their approach, as they directly use sparse mean estimation on , rather than considering Maximum Likelihood Estimation.

#### Robust mean estimation.

The idea in [35, 14] is to leverage recent breakthroughs in robust mean estimation. Very recently, [12, 30] provided the first robust mean estimation algorithms that can handle a constant fraction of outliers. Following their work, [1] extended the ellipsoid algorithm from [12] to robust sparse mean estimation in high dimensions. They show that -sparse mean estimation in with a constant fraction of outliers can be done with samples. The term appears to be necessary, as follows from an oracle-based lower bound [15].

### 1.2 Main contributions

The main results, and our innovations in the context of the above body of work, are as follows.

• We provide an algorithm that is resilient to a small but constant fraction, , of arbitrary outliers (corruptions in the covariates, and/or labels). Our algorithm requires samples. The extra factor of in the term is not information-theoretically optimal, and is due to our use of sparse PCA as a subroutine. The term is not likely improvable for sparse PCA (see, e.g., [4]), and we conjecture it is not improvable for this problem either.111We note that results in [15] show a statistical query-based lower bound of sample complexity on robust sparse mean estimation. Our result is based on a robust variant of Iterative Hard Thresholding (IHT) [8]; specifically, we provide a stability result showing that IHT works with any robust sparse mean estimation algorithm. This is the content of Theorem 2.1.

• Our result can be viewed as a meta-theorem, as it allows the use of any sparse mean estimation subroutine that can provide guarantees with high probability. Then, these guarantees provided carry over to the full IHT-based algorithm. Corollary 3.1 uses a robust sparse mean estimation subroutine based on a version of the ellipsoid algorithm, given and analyzed in [1]. The proof of its performance in [1] hinges on obtaining an upper bound on the sparse operator norm (their Lemmas A.2 and A.3). As we show via counterexample (see Appendix B), the statement of Lemma A.3 seems to be incorrect, and the general approach of upper bounding the sparse operator norm may not work. Nevertheless, the algorithm performance they claim is correct, as we show through a quite different avenue (see Lemma D.3 in Section D.3).

Using this ellipsoid algorithm, our results show that given -corrupted sparse regression samples with identity covariance, we recover within minimax optimal additive error . (We note that this is a stronger guarantee than the low-dimensional robust regression result in [35, 14], which has a guarantee.) In particular, we obtain exact recovery if either the fraction of outliers goes to zero (this is just ordinary sparse regression), or in the presence of a constant fraction of outliers but with the additive noise term going to zero (this is the case of robust sparse linear equations). To the best of our knowledge, this is the first result that shows exact recovery for robust sparse linear equations with a constant fraction of outliers. This is the content of Section 3.

• We then prove a result that may be of interest in its own right: we provide a novel robust sparse mean estimation algorithm that is based on a filtering algorithm for sequentially screening and removing potential outliers, rather than on the ellipsoid algorithm. The filtering algorithm is faster by at least a quadratic factor in the dimension (), and also provides exponential probability guarantees, unlike, e.g., the filtering-style algorithm of [13]. This sharper concentration requires a more careful Martingale-bound analysis. This is the content of Theorem 4.1. Combining this with Theorem 2.1 gives us a more computationally efficient algorithm for robust sparse regression with the same sample complexity . This speedup comes at a cost, as our final result guarantees recovering within an additive error of in the identity covariance case (similarly to [35, 14]). Nevertheless, the result is strong enough to guarantee exact recovery when either or goes to zero. We demonstrate the practical effectiveness of our filtering algorithm in Appendix H.

• For robust sparse regression with unknown covariance matrix, we consider the wide class of sparse covariance matrices [7], where we assume that the covariance matrix is only known to be row/column sparse, but the positions of the non-zero entries are unknown. Based on the filtering algorithm proposed in Section 4, we can guarantee recovery of within an additive error of . We note that our error bound matches SQ-based lower bounds [16] which essentially say that for unknown covariance, even in the low-dimensional setting, an additive error of is inevitable unless sample complexity exceeds . We achieve this error bound with sample complexity sub-linear in , thanks to our structured covariance assumption. This is the content of Section 5.

### 1.3 Setup, Notation and Outline

In this subsection, we formally define the corruption model and the sparse regression model. We first introduce the -corrupted samples described above:

###### Definition 1.1 (ϵ-corrupted samples).

Let be i.i.d. observations follow from a distribution . The -corrupted samples are generated by the following process: an adversary chooses an arbitrary -fraction of the samples in and modifies them with arbitrary values. After the corruption, we use to denote the observations, and use to denote the corruptions.

The parameter represents the fraction of outliers. Throughout, we assume that it is a (small) constant, independent of dimension or other problem parameters. Furthermore, we assume that the distribution is the standard Gaussian-design AWGN linear model.

###### Model 1.1.

The observations follow from the linear model , where is the model parameter, and assumed to be -sparse. We assume that and , where is the normalized covariance matrix with for all . We denote

as the smallest eigenvalue of

, and as its largest eigenvalue. They are assumed to be universal constants in this paper, and we denote the constant .

As in [12], we pre-process via a naive pruning step, removing “obvious” outliers; we henceforth assume that all authentic and corrupted points are within a radius bounded by a polynomial in , and .

Notation. We denote the hard thresholding operator of sparsity by . We define the -sparse operator norm as , where is not required to be positive-semidefinite (p.s.d.). We define the infinity norm for a matrix as . Given index set ,

is the vector restricted to indices

. Similarly, is the sub-matrix on indices . We let denote the Kronecker product, and for a vector , we denote the outer product by . We use trace inner produce to denote . We use

to denote the expectation operator obtained by the uniform distribution over all samples

in a set , and to denote expectation operator obtained by drawing samples from the distribution . In proofs, we use to denote constants that are independent of dimension, but whose value can change from line to line. Finally, we use the notation to hide the dependency on , and to hide the dependency on in our bounds.

## 2 Hard thresholding with robust gradient estimation

In this section, we present our method of using robust sparse gradient updates in IHT. We then show statistical recovery guarantees given any accurate robust sparse gradient estimation, which is formally defined in Definition 2.1.

### 2.1 Methodology

We define the notation for the stochastic gradient corresponding to the point , and the population gradient for based on Model 1.1,

 gti=xi(x⊤iβt−yi), and Gt=Ezi∼P(gti),

where is the distribution of the authentic points. Since , the population mean of all authentic gradients is given by

 Gt=Ezi∼P(xix⊤i(βt−β∗))=Σ(βt−β∗). (1)

In the uncorrupted case where all samples follow from Model 1.1, a single iteration of IHT updates via

 βt+1=Pk′(βt−Ei∈uGgti). (2)

Here, the hard thresholding operator selects the largest elements in magnitude, and the parameter is proportional to (specified in Theorem 2.1).

However, given -corrupted samples according to Definition 1.1, the IHT update eq. 2 based on empirical average of all gradient samples can be arbitrarily bad.

The key goal in this paper is to find a robust estimate to replace in each step of IHT, with sample complexity sub-linear in the dimension . For instance, we consider robust sparse regression with . Then, by eq. 1, is guaranteed to be -sparse in each iteration of IHT. In this case, given -corrupted samples, we can use a robust sparse mean estimator to recover the unknown true from , with sub-linear sample complexity.

More generally, we propose Robust Sparse Gradient Estimator (RSGE) for gradient estimation given -corrupted samples, as defined in Definition 2.1, which guarantees that the deviation between the robust estimate and true , with sample complexity . For a fixed -sparse parameter , we drop the superscript without abuse of notation, and use in place of , and in place of ; denotes the population gradient over the authentic samples’ distribution , at the point .

###### Definition 2.1 (ψ(ϵ)-Rsge).

Given -corrupted samples from Model 1.1, we call a -RSGE, if given , guarantees

 ∥∥ˆG(β)−G(β)∥∥22≤α(ϵ)∥G(β)∥22+ψ(ϵ),

with probability at least .

Here, we use to denote the sample complexity as a function of , and note that the definition of RSGE does not require

to be identity matrix. The parameters

and will be specified by concrete robust sparse mean estimators in subsequent sections. Equipped with Definition 2.1, we propose Algorithm 1, which takes any RSGE as a subroutine in line 8, and runs a robust variant of IHT with the estimated sparse gradient at each iteration in line 9.222Our results require sample splitting to maintain independence between subsequent iterations, though we believe this is an artifact of our analysis. Similar approach has been used in [2, 35] for theoretical analysis. We do not use sample splitting technique in the experiments.

### 2.2 Global linear convergence and parameter recovery guarantees

In each single IHT update step, RSGE introduces a controlled amount of error. Theorem 2.1 gives a global linear convergence guarantee for Algorithm 1 by showing that IHT does not accumulate too much error. In particular, we are able to recover within error given any -RSGE subroutine.

###### Theorem 2.1 (Meta-theorem).

Suppose we observe -corrupted samples from Model 1.1. Algorithm 1, with -RSGE defined in Definition 2.1, with step size outputs , such that

 ∥∥ˆβ−β∗∥∥2=O(√ψ(ϵ)),

with probability at least , by setting and . The sample complexity is .

We give the proof of Theorem 2.1 in Appendix A. The hyper-parameter guarantees global linear convergence of IHT when (when ). This setup has been used in [26, 37], and is proved to be necessary in [32].

## 3 Robust sparse regression with near-optimal guarantee

In this section, we provide near optimal statistical guarantee for robust sparse regression when the covariance matrix is identity. Under the assumption , [1] proposes a robust sparse regression estimator based on robust sparse mean estimation on , leveraging the fact that . With sample complexity , this algorithm produces a such that, with probability at least ,

 ∥∥˜β−β∗∥∥22=˜O(ϵ2(∥β∗∥22+σ2)). (3)

Using Theorem 2.1, we show that we can obtain significantly stronger statistical guarantees which are statistically optimal; in particular, our guarantees are independent of and yield exact recovery when .

### 3.1 RSGE via ellipsoid algorithm

More specifically, the ellipsoid-based robust sparse mean estimation algorithm deals with outliers by trying to optimize the set of weights on each of the samples in – ideally outliers would receive lower weight and hence their impact would be minimized. Since the set of weights is convex, this can be approached using a separation oracle Algorithm 2. The Algorithm 2 depends on a convex relaxation of Sparse PCA, and the hard thresholding parameter is , as the population mean of all authentic gradient samples is guaranteed to be -sparse. In line 6 of Algorithm 2, with each call to the relaxation of Sparse PCA, we obtain an optimal value, , and optimal solution, , to the problem:

 λ∗=maxHTr((ˆΣ−F(ˆG))⋅H),% subject to H≽0,∥H∥1,1≤˜k,Tr(H)=1. (4)

Here, ,

are weighted first and second order moment estimates from

-corrupted samples, and is a function with closed-form

 F(ˆG)=∥ˆG∥22Id+ˆGˆG⊤+σ2Id. (5)

For eq. 5, given the population mean , we have , which calculates the underlying true covariance matrix. We provide more details about the calculation of , as well as some smoothness properties, in Appendix C.

The key component in the separation oracle Algorithm 2 is to use convex relaxation of Sparse PCA eq. 4. This idea generalizes existing work on using PCA to detect outliers in low dimensional robust mean estimation [12, 30]. To gain some intuition for eq. 4, if is an outlier, then the optimal solution of eq. 4, , may detect the direction of this outlier. And this outlier will be down-weighted in the output of Algorithm 2 by the separating hyperplane. Finally, Algorithm 2 will terminate with and output the robust sparse mean estimation of the gradients , which appears in line 9 in Algorithm 1.

Indeed, the ellipsoid-algorithm-based robust sparse mean estimator gives a RSGE, which we can combine with Theorem 2.1 to obtain stronger results. We state these as Corollary 3.1. We note again that the analysis in [1] has a flaw. Their Lemma A.3 is incorrect, as our counterexample in Appendix B demonstrates. We provide a correct route of analysis in Lemma D.3 of Appendix D.

### 3.2 Near-optimal statistical guarantees

###### Corollary 3.1.

Suppose we observe -corrupted samples from Model 1.1 with . By setting , if we use the ellipsoid algorithm for robust sparse gradient estimation with , it requires samples, and guarantees . Hence, Algorithm 1 outputs , such that

 ∥∥ˆβ−β∗∥∥2=˜O(σϵ),

with probability at least , by setting .

For a desired error level , we only require sample complexity . Hence, we can achieve statistical error . Our error bound is nearly optimal compared to the information-theoretically optimal in [20], as the term is necessary by an oracle-based SQ lower bound [15].

###### Proof sketch of Corollary 3.1.

The key to the proof relies on showing that controls the quality of the weights of the current iteration, i.e., small means good weights and thus a good current solution. Showing this relies on using to control . Lemma A.3 in [1] claims that . As we show in Appendix B, however, this need not hold. This is because the trace norm maximization eq. 4 is not a valid convex relaxation for the -sparse operator norm when the term is not p.s.d. (which indeed it need not be). We provide a different line of analysis in Lemma D.3, essentially showing that even without the claimed (incorrect) bound, can still provide the control we need. With the corrected analysis for , the ellipsoid algorithm guarantees with probability at least . Therefore, the algorithm provides an -RSGE. ∎

From a computational viewpoint, the time complexity of Algorithm 1 depends on the RSGE in each iterate. The time complexity of the ellipsoid algorithm is indeed polynomial in the dimension, but it requires calls to a relaxation of Sparse PCA ([9]). In the next section, we introduce a faster algorithm as a RSGE, which only requires calls of Sparse PCA (recall that only scales with ).

## 4 Computationally efficient robust sparse mean estimation

In this section, we propose a different robust sparse mean estimator which only requires Sparse PCA calls. We provide our theoretical results when we have identity covariance, and use the shorthand . Similar to the ellipsoid algorithm, the core ingredient is (a relaxation of) Sparse PCA. We use calls to Sparse PCA to discard samples, and hence the number of calls we need is linear in the sample size; in contrast, the ellipsoid algorithm updates each sample’s weight via an optimization approach that uses Sparse PCA as a separation oracle, and hence it requires at least calls to the separation oracle [9].

Our proposed RSGE (Algorithm 3) attempts to remove one outlier at each iteration, as long as a good solution has not already been identified. It first estimates the gradient by hard thresholding (line 4) and then estimates the corresponding sample covariance matrix (line 5). By solving (a relaxation of) Sparse PCA, we obtain a scalar as well as a matrix . If is smaller than the predetermined threshold , we have a certificate that the effect of the outliers is well-controlled (specified in eq. 9). Otherwise, we compute a score for each sample based on

, and discard one of the samples according to a probability distribution where each sample’s probability of being discarded is proportional to the score we have computed

333Although we remove one sample in Algorithm 3, our theoretical analysis naturally extend to removing constant number of outliers. This fasten the algorithm in practice, yet shares the same computational complexity.

Algorithm 3 can be used for other robust sparse functional estimation problems (e.g., robust sparse mean estimation for , where is -sparse).

### 4.1 More computationally efficient RSGE

To use Algorithm 3 as a RSGE given gradient samples (denoted as ), we call Algorithm 3 repeatedly on and then on its output, , until it returns a robust estimator . The next theorem provides guarantees on this iterative application of Algorithm 3.

###### Theorem 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Let be an -corrupted set of gradient samples . By setting , if we run Algorithm 3 iteratively with initial set , and subsequently on , and use , 444Similar to [12, 13, 14, 1], our results seem to require this side information. then this repeated use of Algorithm 3 will stop after at most iterations, and output , such that

 ∥∥ˆGt−Gt∥∥22=˜O(ϵ(∥∥Gt∥∥22+σ2)),

with probability at least . Here, is a constant depending on , where is a constant.

Thus, Theorem 4.1 shows that with high probability, Algorithm 3 provides a Robust Sparse Gradient Estimator where . For example, we can take . We note that this significantly improves the probability bound in [14]. Combining now with Theorem 2.1, we obtain an error guarantee for robust sparse regression.

###### Corollary 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Under the same setting as Theorem 4.1, if we use Algorithm 3 for robust sparse gradient estimation, it requires samples, and , then we have,

 ∥∥ˆβ−β∗∥∥2=˜O(σ√ϵ) (8)

with probability at least .

Similar to Section 3, we can achieve statistical error . The scaling of in eq. 8 is , which is the same as in [35, 14]

for low dimensional robust linear regression. These guarantees are worse than

achieved by ellipsoid methods. Nevertheless, this result is strong enough to guarantee exact recovery when either or goes to zero.

The key step in Algorithm 3 is outlier removal eq. 7 based on the solution of Sparse PCA’s convex relaxation eq. 6. We describe the outlier removal below, and then give the proofs in Appendix E and Appendix F. We demonstrate the effectiveness of robust estimation for the filtering algorithm in Appendix H.

### 4.2 Outlier removal guarantees in Algorithm 3

In Algorithm 3, we denote samples in the input set as . This input set can be partitioned into two parts: , and . Lemma 4.1 shows that Algorithm 3 can return a guaranteed gradient estimate, or the outlier removal step eq. 7 is likely to discard an outlier. The guarantee on the outlier removal step eq. 7 hinges on measuring the projection scores – if is less than , we can show eq. 7 is likely to remove an outlier.

###### Lemma 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Let be an -corrupted set of gradient samples . Algorithm 3 computes that satisfies

 (9)

If , then with probability at least , we have

 ∑i∈Sgoodτi≤1γ∑i∈Sinτi, (10)

where is defined in line 10, is a constant depending on , and is a constant.

The proofs are collected in Appendix E. In a nutshell, eq. 9 is a natural convex relaxation for the sparsity constraint . The upper bound eq. 10 indicates that when , the contribution of is relatively small, which can be obtained through concentration inequalities for the samples in .

Based on Lemma 4.1, if , then the RHS of eq. 9 is bounded, leading to the error guarantee of . On the other hand, if , we can show that eq. 7 is more likely to throw out samples of rather than . Iteratively applying Algorithm 3 on the remaining samples, we can remove those outliers with large effect, and keep the remaining outliers’ effect well-controlled. This leads to the final bounds in Theorem 4.1.

## 5 Robust sparse regression with unknown covariance

In this section, we consider robust sparse regression where the covariance matrix is unknown, but equipped with additional sparsity structure. Formally, we define the sparse covariance matrices as follows:

###### Model 5.1 (Sparse covariance matrices).

In Model 1.1, the authentic covariates are drawn from . We assume that each row and column of is -sparse, but the positions of the non-zero entries are unknown.

Model 5.1

is widely studied in high dimensional statistics

[7, 19, 42]. Under Model 5.1, for the population gradient , where we use to denote the -sparse vector , we can guarantee the . Hence, we can use the filtering algorithm (Algorithm 3) with as a RSGE for robust sparse regression with unknown . When the covariance is unknown, we cannot evaluate a priori, thus the ellipsoid algorithm is not applicable to this case. And we provide error guarantees as follows.

###### Corollary 5.1.

Suppose we observe -corrupted samples from Model 1.1, where the covariates ’s follow from Model 5.1. If we use Algorithm 3 for robust sparse gradient estimation, it requires samples, and , then, we have

 ∥∥ˆβ−β∗∥∥2=˜O(σ√ϵ), (11)

with probability at least .

In the low dimensional robust regression with unknown covariance, [16] stated a computational lower bound that any polynomial time SQ algorithm with samples must incur an additive error . Based on the structured covariance Model 5.1 in high dimensional setting, we achieve this error bound in eq. 11, and the sample complexity is still sub-linear in .

We show the performance of robust estimation using our filtering algorithm with unknown covariance in Appendix H, and we observe same linear convergence as Section 4.

## References

• [1] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017.
• [2] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
• [3] Boaz Barak and David Steurer. Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course notes: http://www. sumofsquares. org/public/index. html, 2016.
• [4] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046–1066, 2013.
• [5] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
• [6] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
• [7] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577–2604, 2008.
• [8] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
• [9] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

Foundations and Trends® in Machine Learning

, 8(3-4):231–357, 2015.
• [10] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huber’s contamination model. Ann. Statist., 46(5):1932–1960, 10 2018.
• [11] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
• [12] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
• [13] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008, 2017.
• [14] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
• [15] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 73–84. IEEE, 2017.
• [16] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2745–2754. SIAM, 2019.
• [17] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
• [18] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui.

Optimal solutions for sparse principal component analysis.

Journal of Machine Learning Research, 9(Jul):1269–1294, 2008.
• [19] Noureddine El Karoui. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics, 36(6):2717–2756, 2008.
• [20] Chao Gao. Robust regression via mutivariate regression depth. arXiv preprint arXiv:1702.04656, 2017.
• [21] Michael Grant, Stephen Boyd, and Yinyu Ye. Cvx: Matlab software for disciplined convex programming, 2008.
• [22] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
• [23] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer.

Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors.

In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

, pages 178–191. ACM, 2016.
• [24] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73–101, 1964.
• [25] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
• [26] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
• [27] David S Johnson and Franco P Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107, 1978.
• [28] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient Algorithms for Outlier-Robust Regression. arXiv preprint arXiv:1803.03241, 2018.
• [29] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715–2740, 2009.
• [30] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
• [31] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013.
• [32] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative thresholding algorithms. arXiv preprint arXiv:1804.08841, 2018.
• [33] Liu Liu, Tianyang Li, and Constantine Caramanis. High dimensional robust estimation of sparse models via trimmed hard thresholding. arXiv preprint arXiv:1901.08237, 2019.
• [34] Nam H Nguyen and Trac D Tran. Exact recoverability from dense corrupted observations via l1-minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
• [36] Peter J Rousseeuw and Annick M Leroy.

Robust regression and outlier detection

, volume 589.
John wiley & sons, 2005.
• [37] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650–7691, 2017.
• [38] Charles M Stein.

Estimation of the mean of a multivariate normal distribution.

The annals of Statistics, pages 1135–1151, 1981.
• [39] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
• [40] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• [41] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In Advances in neural information processing systems, pages 2670–2678, 2013.
• [42] Martin Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
• [43] Huan Xu, Constantine Caramanis, and Shie Mannor. Outlier-robust PCA: the high-dimensional case. IEEE transactions on information theory, 59(1):546–572, 2013.
• [44] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047–3064, 2012.

## Appendix A Proofs for the meta-theorem

In this section, we prove the global linear convergence guarantee given the Definition 2.1. In each iteration of Algorithm 1, we use to update

 βt+1=Pk′(βt−ηˆGt),

where is a fixed step size. Given the condition in RSGE’s definition, we show that Algorithm 1 linearly converges to a neighborhood around with error at most .

First, we introduce a supporting Lemma from [37], which bounds the distance between and in each iteration of Algorithm 1.

###### Lemma A.1 (Theorem 1 in [37]).

Let be an arbitrary vector and be any -sparse signal. For any , we have the following bound:

We choose the hard thresholding parameter , hence

###### Theorem A.1 (Theorem 2.1).

Suppose we observe -corrupted samples from Model 1.1. Algorithm 1, with -RSGE defined in Definition 2.1, with step size outputs , such that

 ∥∥ˆβ−β∗∥∥2=O(√ψ(ϵ)),

with probability at least , by setting and . The sample complexity is .

###### Proof.

By splitting samples into sets (each set has sample size ), Algorithm 1 collects a fresh batch of samples with size at each iteration . Definition 2.1 shows that for the fixed gradient expectation , the estimate for the gradient satisfies:

 ∥∥ˆGt−Gt∥∥22≤α(ϵ)∥∥Gt∥∥22+ψ(ϵ) (12)

with probability at least , where is determined by .

Letting , we study the -th iteration of Algorithm 1. Based on Lemma A.1, we have

 ∥∥βt+1−β∗∥∥2 ≤√ζ∥∥βt−ηˆG−β∗∥∥2 =√ζ∥∥βt−ηG−β∗+η(G−ˆG)∥∥2 (i)≤√ζ∥∥(Id−ηΣ)(βt−β∗)∥∥2+√ζη√α(ϵ)∥G∥22+ψ(ϵ)

where (i) follows from the theoretical guarantee of RSGE, and (ii) follows from the basic inequality for non-negative .

By setting , we have

 ∥∥βt+1−β∗∥∥2 ≤√ζ∥∥(Id−ηΣ)(βt−β∗)∥∥2+√ζη√α(ϵ)∥∥Σ(βt−β∗)∥∥2+√ζη√ψ(ϵ) ≤√ζ(1−1cκ+√α(ϵ))∥∥βt−β∗∥∥2+√ζη√ψ(ϵ) (13)

When is a small enough constant, we have , then

 √ζ(1−1cκ+√α(ϵ)) ≤√ζ(1−12cκ) ≤√1+ρ+√(4+ρ)ρ2(1−12cκ)

Plugging in the parameter in Lemma A.1, we have

 √ζ(1−1cκ+√α(ϵ))≤1−110cκ

Together with eq. 13, we have the recursion

 ∥∥βt+1−β∗∥∥2≤(1−110cκ)∥∥βt−β∗∥∥2+√ζη√ψ(ϵ).

By solving this recursion and using a union bound, we have

with probability at least .

By the definition of and , we have

## Appendix B Correcting Lemma A.3 in [1]’s proof

A key part of the proof of the main theorem in [1] is to obtain an upper bound on the -sparse operator norm. Specifically, their Lemmas A.2 and A.3 aim to show:

 λ∗≥∥∥ ∥∥|S|∑i=1wi(gi−ˆG(w))⊗2−F(ˆG(w))∥∥ ∥∥˜k,op≥∥∥P˜k(˜Δ(w))∥∥225ϵ, (14)

where , 555The are weights, and these are defined precisely in Section D, but are not required for the present discussion or counterexample., and recall is the solution to the SDP as given in Algorithm 3.

Lemma A.3 asserts the first inequality above, and Lemma A.2 the second. As we show below, Lemma A.3 cannot be correct. Specifically, the issue is that the quantity inside the second term in eq. 14 may not be positive semidefinite. In this case, the convex optimization problem whose solution is is not a valid relaxation, and hence the they obtain need not a valid upper bound. Indeed, we give a simple example below that illustrates precisely this potential issue.

Fortunately, not all is lost – indeed, as our results imply, the main results in [1] is correct. The key is to show that while does not upper bound the sparse operator norm, it does, however, upper bound the quantity

 max∥v∥2=1,∥v∥0≤˜kv⊤⎛⎝|S|∑i=1wi(gi−ˆG(w))⊗2−F(ˆG(w))⎞⎠v. (15)

We show this in Appendix D. More specifically, in Lemma D.3, we replace the -sparse operator norm in the second term of eq. 14 by the term in eq. 15. We show this can be used to complete the proof in Section D.4.

We now provide a counterexample that shows the first inequality in (14) cannot hold. The main argument is that the convex relaxation for sparse PCA is a valid upper bound of the sparse operator norm only for positive semidefinite matrices. Specifically, denoting