# High Dimensional Robust Estimation of Sparse Models via Trimmed Hard Thresholding

We study the problem of sparsity constrained M-estimation with arbitrary corruptions to both explanatory and response variables in the high-dimensional regime, where the number of variables d is larger than the sample size n. Our main contribution is a highly efficient gradient-based optimization algorithm that we call Trimmed Hard Thresholding -- a robust variant of Iterative Hard Thresholding (IHT) by using trimmed mean in gradient computations. Our algorithm can deal with a wide class of sparsity constrained M-estimation problems, and we can tolerate a nearly dimension independent fraction of arbitrarily corrupted samples. More specifically, when the corrupted fraction satisfies ϵ≲1 /(√(k) (nd)), where k is the sparsity of the parameter, we obtain accurate estimation and model selection guarantees with optimal sample complexity. Furthermore, we extend our algorithm to sparse Gaussian graphical model (precision matrix) estimation via a neighborhood selection approach. We demonstrate the effectiveness of robust estimation in sparse linear, logistic regression, and sparse precision matrix estimation on synthetic and real-world US equities data.

## Authors

• 53 publications
• 6 publications
• 43 publications
• ### High Dimensional Robust Sparse Regression

We provide a novel -- and to the best of our knowledge, the first -- alg...
05/29/2018 ∙ by Liu Liu, et al. ∙ 4

• ### Gradient Hard Thresholding Pursuit for Sparsity-Constrained Optimization

Hard Thresholding Pursuit (HTP) is an iterative greedy selection procedu...
11/22/2013 ∙ by Xiao-Tong Yuan, et al. ∙ 0

• ### Robust Structured Statistical Estimation via Conditional Gradient Type Methods

Structured statistical estimation problems are often solved by Condition...
07/07/2020 ∙ by Jiacheng Zhuo, et al. ∙ 5

• ### Robust Regression via Hard Thresholding

We study the problem of Robust Least Squares Regression (RLSR) where sev...
06/08/2015 ∙ by Kush Bhatia, et al. ∙ 0

• ### High-dimensional structure estimation in Ising models: Local separation criterion

We consider the problem of high-dimensional Ising (graphical) model sele...
07/08/2011 ∙ by Vincent Y. F. Tan, et al. ∙ 0

• ### Efficient and Consistent Robust Time Series Analysis

We study the problem of robust time series analysis under the standard a...
07/01/2016 ∙ by Kush Bhatia, et al. ∙ 0

• ### Robust High Dimensional Expectation Maximization Algorithm via Trimmed Hard Thresholding

In this paper, we study the problem of estimating latent variable models...
10/19/2020 ∙ by Di Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Statistical estimation with heavy tailed outliers or even arbitrary corruptions has long been a focus in robust statistics

[Box53, Tuk75, Hub11, HRRS11]. In many practical applications, such as gene expression analysis [LW01] and financial forecasting [dP18], it is important to control the impact of outliers in the dataset in order to obtain meaningful results.

To introduce our robust -estimation procedure, we first recap -estimation in the setting with no outliers. -estimation is a standard technique for statistical estimation [vdV00]

, such as linear regression, logistic regression, and more generally maximum likelihood estimation. Moreover, empirical risk minimization (ERM)

[Vap13, DGL13] is the optimal algorithm for many learning problems. These have been extended successfully to the high dimensional setting with sparsity (or other low-dimensional structure), e.g., using Lasso [Tib96, BvdG11, HTW15, Wai19].

Huber’s seminal work [Hub64] proposed -estimation methods for robust regression, where a fraction of response variables are outliers. For these types of robust -estimators, they replace the classical least squared risk minimization objective with a robust counterpart (e.g., Huber loss). However, when there are outliers in the explanatory variables (covariates), there is little advantage over the square loss.

Sparse estimation is also important for estimating Gaussian graphical model structures in the high dimensional setting. It is well known that the conditional independence relationships of Gaussian random variables correspond to the sparsity pattern of its precision (inverse covariance) matrix

[KFB09, WJ08]. A well known work [MB06] developed a sparse-regression based neighborhood selection approach for recovering this sparsity pattern by regressing each variable against its neighbors using Lasso. As we discuss below, a key aspect of our contributions is that (unlike previous high dimensional robustness results) our approach is flexible enough to apply to robust sparse graphical model estimation.

In this paper, we study the problem of sparsity constrained -estimation with arbitrary corruptions to both explanatory and response variables in the high-dimensional regime (). To model the corruptions, we assume that the adversary replaces an arbitrary -fraction of the authentic samples with arbitrary values (creftypecap 2.1). This is stronger than Huber’s contamination model [Hub64]

, which only allows the adversary to independently corrupt each sample with some probability.

### 1.1 Related works

#### Robust sparse regression.

Earlier works in robustness of sparse regression problems consider heavy tailed distributions or arbitrary corruptions only in the response variables [Li13, BJK15, BJK17, Loh17, KP18]. Yet these algorithms cannot be trivially extended to the setting with corrupted explanatory variables, which is the main focus of this work.

Another line of recent research [ACG13, VMX17, YLA18, SS18] focus on alternating minimization approaches which extend Least Trimmed Squares [Rou84]. However, these methods only have local convergence guarantees, and cannot handle arbitrary corruptions.

[CCM13] was one of the first papers to provide guarantees for sparse regression with arbitrary outliers in both response and explanatory variables. Those results are specific to sparse regression, however, and cannot be trivially extended to general -estimation problems. Moreover, even for linear regression with covariance matrix , the statistical rates are not minimax optimal. [Gao17] optimizes Tukey depth [Tuk75, CGR18] for robust sparse regression under the Huber -contamination model, and their algorithm is minimax optimal and can handle a constant fraction of outliers (). However, it is intractable to compute the Tukey depth [JP78]. Recent results [BDLS17, LSLC18] leverage robust sparse mean estimation in robust sparse regression. Their algorithms are computationally tractable, and can tolerate , but they require restrictive assumptions on the covariance matrix (), which, in particular, precludes their use in applications such as graphical model estimation.

#### Robust M-estimation via robust gradient descent.

Works in [CSX17, HI17] and later [YCRB18a] first leveraged the idea of using robust mean estimation in each step of gradient descent, using a subroutine such as geometric median, and applied this to defend against Byzantine gradient attacks. A similar approach using more sophisticated robust mean estimation methods was later proposed in [PSBR18, DKK18, YCRB18b, SX18] for robust gradient descent. These methods all focused on low dimensional robust -estimation. Work in [LSLC18] extended the approach to the high-dimensional setting (though is limited to ).

Even though the corrupted fraction can be independent of the ambient dimension by using sophisticated robust mean estimation algorithms [DKK16, LRV16, SCV17], or the sum-of-squares framework [KKM18], these algorithms (except [LSLC18]) are not applicable to the high dimensional setting (), as they require at least samples.

#### Robust estimation of graphical models.

A line of research using a robustified covariance matrix to substitute for the sample covariance matrix in Gaussian graphical models [LHY12b, WG17, LT18] leverages GLasso [FHT08] or CLIME [CLL11] to estimate the sparse precision matrix. These robust methods are restricted to Gaussian graphical model estimation, and their techniques cannot be generalized to other -estimation problems.

### 1.2 Main contributions

• We propose Trimmed Hard Thresholding (Algorithm 1) for sparsity constrained -estimation with arbitrary corruptions in both explanatory and response variables. This algorithm has global linear convergence rate, and is computationally efficient – we incur little over head compared with vanilla gradient descent. This is in particular much faster than algorithms relying on sparse PCA relaxations as subroutines ([BDLS17, LSLC18]).

• With corruptions in both response and explanatory variables, we provide theoretical guarantees on statistical estimation in sparse linear and logistic regression. Our algorithm has minimax-optimal statistical error, and tolerates -fraction of outliers. This fraction is nearly independent of the , which is important in the high dimension regime.

• We extend Trimmed Hard Thresholding to neighborhood selection [MB06] for estimating Gaussian graphical models. We provide sparsity recovery guarantees for model selection in sparse precision matrix estimation, and this result shares similar robustness guarantees with sparse linear regression.

• We demonstrate the effectiveness of Trimmed Hard Thresholding on both synthetic data and (unmodified) real data. We also empirically show the performance of our algorithm under model misspecification.

#### Notations.

We denote the Hard Thresholding operator of sparsity by , and denote the Euclidean projection onto the ball by . We use

to denote the expectation operator obtained by the uniform distribution over all samples

.

## 2 Problem formulation

We now formally define the corruption model and the sparsity constrained -estimation. We first introduce the corruption model described above:

###### Definition 2.1 (ϵ-corrupted samples).

Let be i.i.d. observations with distribution . We say that a collection of samples is -corrupted if an adversary chooses an arbitrary -fraction of the samples in and modifies them with arbitrary values.

This corruption model allows corruptions in both explanatory and response variables in regression problems where we observe . creftypecap 2.1 also allows the adversary to select an -fraction of samples to delete and corrupt, hence it is stronger than Huber’s -contamination model [Hub64], where the corrupted samples are only added independently with probability .

Next, we introduce a classical sparsity constrained -estimation which minimizes the empirical risk only on i.i.d. observations in the set . Suppose that we are interested in estimating some parameter of . Let

be a convex and differentiable loss function. Our target is the unknown population minimizer

. Note that ’s definition allows model misspecification. And we solve a sparsity constrained ERM to estimate

 minβ∈Bf(β)=Ei∈uGℓi(β;zi),s.t.∥β∥0≤k, (1)

where, is defined as the empirical average on , and is the ball including .

For example, consider linear regression on samples , with the squared loss function eq. 2a. For logistic regression, we require and use the logistic loss eq. 2b 111While our results focus on the squared and logistic loss, our experiments indicate that our approach also works for other loss functions in high dimensional robust -estimation. We also demonstrate experimental results for other -estimators (e.g., Huber loss [Hub64, Loh17]).. equationparentequation

 (yi−x⊤iβ)2, (2a) log(1+exp(−yix⊤iβ)). (2b)

To solve eq. 1, under restricted strong convexity (RSC) and restricted smoothness (RSM) conditions (Section 4), a well known result [NRWY12] shows statistical results based on a convex relaxation from to . From an optimization viewpoint, existing results reveal that gradient descent algorithms equipped with soft-thresholding [ANW12] or hard-thresholding [BD09, JTK14, SL17, YLZ18, LB18] have linear convergence rate, and achieve known minimax lower bounds in statistical estimation [RWY11, ZWJ14].

## 3 Robust sparse estimation via Trimmed Hard Thresholding

Given -corrupted set (creftypecap 2.1), directly running ERM on the entire input dataset with corruptions

 minβ∈BEi∈uSℓi(β;zi),s.t.∥β∥0≤k, (3)

can be arbitrarily corrupted. In this case, we are interested in estimating the unknown parameter from -corrupted samples . Classical robust statistics seeks finding a robust counterpart to the loss function in eq. 3, yet this approach cannot deal with the contamination model (creftypecap 2.1), where can both be arbitrarily corrupted. To address this challenge, recent works use robust mean estimation in each step of gradient step to make eq. 3 robust, even with an -fraction of being arbitrary outliers.

Here, we use a robust mean estimator in each step of gradient descent. We robustify gradient descent by a robust mean estimation over each sample’s gradient . Although different robust mean estimators can be used, we use to denote the robust gradient estimate at (shorthanded as without abuse of notation). We write gradient as the gradient on the good samples , and as the population gradient

 G =Ei∈uG∇ℓi(β;zi), (4) Gpop =Ezi∼P∇ℓi(β;zi). (5)

To guarantee that the final output of robust gradient descent can recover with , one way is to use a robust mean estimator on gradients in each iteration to estimate the population gradient with , where positive quantities and are independent of gradient descent iterations. Note that different robust mean estimators have different and for different statistical models, and they are functions of .

In the high dimensional setting, we would require a high dimensional robust mean estimator in each gradient descent step. Previous methods such as [CSX17, HI17, PSBR18, DKK18, YCRB18a, YCRB18b, SX18] cannot be used because they all require to bound . A recent work [LSLC18] on robust sparse linear regression uses a robust sparse mean estimator [BDLS17] to guarantee with sample complexity . However, their algorithm can only deal with sparse linear regression under the restricted assumption , and cannot be extended to our general -estimation problems. Our algorithm provides a robust gradient estimate during each iteration in IHT, thus can be used for general sparse -estimation problems.

### 3.2 Trimmed Hard Thresholding

In our algorithm, we use a dimensional trimmed gradient estimate for – the gradient defined on the original authentic data set (eq. 4). We show that hard thresholding combined with dimensional trimmed gradient estimator with bound (instead of ) is the key ingredient for the robustness in sparsity constrained statistical estimation.

Let us define the dimensional trimmed gradient estimator:

###### Definition 3.1 (Dimensional α-trimmed gradient estimator).

Given a set of -corrupted samples of stochastic gradients , for each dimension , the dimensional -trimmed gradient estimator removes the largest and smallest fraction of elements in , and calculates the mean of the remaining terms. We choose for some constant , and require for a small constant .

We show in creftypecap 3.1 below, that this dimensional trimmed gradient estimator can guarantee with high probability, in a gradient type optimization algorithm. However, a naive application of trimmed gradient estimator only gives the bound , where perturbation due to outliers scales with the ambient dimension . This gives a sub-optimal (in fact trivial) result for robust gradient descent with trimmed gradients, namely, the fraction of corruptions it can tolerate is less than , which tends to in the high dimensional regime.

To address this issue, we propose Trimmed Hard Thresholding (Algorithm 1), which uses hard thresholding after each trimmed gradient update222Our theory requires splitting samples across different iterations to maintain independence between iterations. We believe this is an artifact of the analysis, and do not use this our experiments. [BWY17, PSBR18] use a similar approach for theoretical analysis. . In line 8, we use the trimmed gradient estimator (creftypecap 3.1) to obtain the robust gradient estimate . In line 9, we update the parameter by taking a hard thresholding . The hard thresholding operator , can be implemented by selecting the largest elements in magnitude. Here the hyper-parameter is proportional to (eq. 7).

A key observation in line 9 is that, in each step of IHT, the iterate is sparse, and this enforces the perturbation from outliers to only depend on IHT’s sparsity instead of the ambient dimension . Based on a careful analysis of hard thresholding operator in each iteration, we show that the perturbation due to outliers in IHT is at most

 √k′+k∥∥ˆG−G∥∥∞. (6)

We prove that the perturbation eq. 6 is nearly independent of dimension (eq. 8). Hence the contamination fraction we can tolerate is nearly independent of the ambient dimension. This follows from trimmed gradient estimator guarantees on (creftypecap 3.1 and 3.2).

In this section, we show gradient estimation guarantees using the trimmed gradient estimator (creftypecap 3.1) in linear and logistic regression.

###### Model 3.1 (Sparse linear regression).

Under the contamination model creftypecap 2.1, authentic samples are drawn from a standard sub-Gaussian design linear model : , with being -sparse. We assume that

’s are i.i.d. samples from a sub-Gaussian distribution with normalized covariance matrix

, where for all , and the stochastic sub-Gaussian noise has mean

and variance

.

###### Model 3.2 (Sparse logistic regression).

Under the contamination model creftypecap 2.1, authentic samples are drawn from a binary classification model , where the binary label

follows the conditional probability distribution

, with being -sparse. We assume that ’s are i.i.d. samples from a sub-Gaussian distribution with normalized covariance matrix , where for all .

In both linear and logistic regression, we provide detailed analysis of . Compared to previous analysis of trimmed mean estimation [CCM13, YCRB18a], our results hold even if an adversary deletes an -fraction of samples from as defined in creftypecap 2.1.

###### Proposition 3.1.

Suppose that we observe -corrupted (creftypecap 2.1) samples from sparse linear regression model (creftypecap 3.1). The dimensional -trimmed gradient estimator with can guarantees that

 ∥∥ˆG−G∥∥∞=O(√∥Δ∥22+σ2(ϵlog(nd)+√logdn)),

with probability at least , where , and is a universal constant.

###### Proposition 3.2.

Suppose that we observe -corrupted (creftypecap 2.1) samples from the sparse logistic regression model (creftypecap 3.2). The dimensional -trimmed gradient estimator with guarantees that

 ∥∥ˆG−G∥∥∞=O(ϵlog(nd)+√logdn),

with probability at least , where is a universal constant.

The proofs are given in Appendix A, where we prove guarantees on recovering the true gradient from -corrupted sub-exponential gradient samples using the trimmed mean estimator.333The sub-exponential assumption is necessary. In sparse linear regression, the gradient’s distribution is indeed sub-exponential.

Model misspecification. Our algorithm is not limited to true linear models, although results are established under linear assumptions. In Section 6, we show that, in a nonlinear model, results of Algorithm 1 on -corrupted input is comparable to sparse linear regression on uncorrupted input.

## 4 Global linear convergence and statistical guarantees

In this section, we provide statistical guarantees under the models in Section 3.3. Using creftypecap 3.1 and creftypecap 3.2, we prove Algorithm 1’s global convergence, and study its statistical guarantees. Recall that is the empirical risk defined on i.i.d. samples , our analysis depends on RSC/RSM conditions of the curvature of [NRWY12].

###### Definition 4.1 (Rsc/rsm).

A differentiable function satisfies -restricted strong convexity (RSC) with parameter , and -restricted smoothness (RSM) with parameter at sparsity level , if

 f(β1) ≥f(β0)+⟨G(β0),β1−β0⟩+μα2∥β1−β0∥22, f(β1)

hold respectively, for all with and . We define the condition number . And we set the hyper-parameter as

 k′=4ρ2k. (7)

### 4.1 Theoretical Results

For sparse linear and logistic regression, we present global convergence and statistical guarantees in creftypecap 4.1 and creftypecap 4.2, respectively. The proofs are collected in Appendix B.

#### Sparse linear regression.

By [RWY10, ANW12], when , we RSC/RSM condition is satisfied with , and .

###### Theorem 4.1.

Suppose that we observe -corrupted samples (creftypecap 2.1) from the sparse linear regression model (creftypecap 3.1). Under the condition , and , Algorithm 1 with outputs satisfying

with probability at least , where we set .

Time complexity. Algorithm 1 has a global linear convergence rate. In each iteration, we only use operations complexity to calculate trimmed mean. We incur logarithmic overhead compared to normal gradient descent [Bub15].

Statistical accuracy and robustness. Compared with [CCM13, BDLS17], our statistical error rate is minimax optimal [RWY11, ZWJ14], and has no dependencies on . Furthermore, the upper bound on is nearly independent of the ambient dimension , which guarantees Algorithm 1’s robustness in high dimensions.

#### Sparse logistic regression.

Because in Algorithm 1, we project each iterate onto a ball independent of and (line 10), we have strictly positive . It is well known that, when for some problem dependent constant , the RSC and RSM parameters are strictly positive constants depending on the underlying distribution [NRWY12, LPR15].

###### Theorem 4.2.

Suppose that we observe -corrupted samples (creftypecap 2.1 ) the sparse logistic regression model (creftypecap 3.2). Algorithm 1 with outputs , such that

with probability at least , when we set .

Statistical accuracy and robustness. Under the sparse Gaussian linear discriminant analysis (LDA) model, which is a special case of creftypecap 3.2, Algorithm 1 achieves the statistical minimax rate [LPR15, LYCR17].

### 4.2 Proof outline

Here, we give a proof outline for sparse linear regression (creftypecap 4.1), and the technique for creftypecap 4.2 is similar. The detailed proofs are given in Appendix B.

Since and are enforced to be -sparse, and (the empirical risk defined on ) satisfies -RSC and -RSM. Combining the two inequalities, we obtain

 f(\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt)−f(β∗)≤⟨G(βt−1),\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt−β∗⟩−μα2∥∥β∗−βt−1∥∥22+μL2∥∥\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt−βt−1∥∥22(♢).

Expanding the term , we have

 (♢) =μL2∥∥βt−1−β∗∥∥22−μL2∥∥\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt−βt−1∥∥22 −⟨ˆG(βt−1),\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt−β∗⟩+μL⟨(βt−1−ηˆG(βt−1))−\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt,β∗−\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt⟩

Recall that is directly obtained from hard thresholding, and because , by applying Lemma 1 of [LB18] in the last term (more details in Appendix B), we have

Putting together the pieces, and using , , we obtain the key inequality

 f(\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111βt)−f(β∗) ≤μL2(ρ−1ρ∥∥Δt−1∥∥22−2ρ−12ρ∥∥\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Δt∥∥22)+∣∣⟨G(βt−1)−ˆG(βt−1),\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Δt⟩∣∣(♣).

Next, we use the crucial fact that the perturbation due to outliers (the inner product term ) is nearly independent of the ambient dimension , because hard thresholding enforces to be sparse, resulting in the bound eq. 6. And this implies

 (8)

Applying creftypecap 3.1, and using convexity to obtain a lower bound on , we can solve a quadratic inequality to obtain the following recursion

 ∥∥\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Δt∥∥2≤(1−18ρ)∥∥Δt−1∥∥2+4ΓσμL,

which holds with high probability. Here . Since Euclidean projection guarantees [Bub15], we have established the global linear convergence rate for Algorithm 1. Plugging in , we can obtain the following statistical guarantee

 ∥∥ˆβ−β∗∥∥2=O(ρ2σ(ϵ√klog(nd)+√klogdn)).

## 5 Sparsity recovery and its application in Gaussian graphical model estimation

In this section, we demontrate the sparsity recovery performance of Algorithm 1, which is of great importance in sparse -estimation and graphical model learning [MB06, Wai09, RWL10, RWRY11, BvdG11, HTW15].

### 5.1 Sparsity recovery guarantees

We use to denote top indexes of with the largest magnitude. Let denote the smallest absolute value of nonzero element of . To control the false negative rate, creftypecap 5.1 shows that under the -condition, is exactly . The proofs are given in Appendix C. Sparsity recovery guarantee for sparse logistic regression is similar, and is omitted due to space constraint.

###### Theorem 5.1.

Under the same condition as in creftypecap 4.1, and a -condition

 β∗min=Ω(ρ2σ(ϵ√klog(nd)+√klogdn)), (9)

Algorithm 1 guarantees that , with probability at least .

Existing results on sparsity recovery for regularized estimators [Wai09, LSRC15] do not require the RSC condition creftypecap 4.1, but instead require an irrepresentability condition, which is stronger. If , eq. 9 has the same -condition as IHT for sparsity recovery [YLZ18].

### 5.2 Robust Gaussian graphical model selection

Here, we consider sparse precision matrix estimation for Gaussian graphical models. It is well known that the sparsity pattern of its precision matrix matches the conditional independence relationships of Gaussian random variables [KFB09, WJ08].

###### Model 5.1 (Sparse precision matrix estimation).

Under the contamination model creftypecap 2.1, authentic samples are drawn from a multivariate Gaussian distribution . We assume that each row of the precision matrix (excluding diagonal entry) is -sparse – each node has at most edges.

For the uncorrupted samples drawn from the Gaussian graphical model, the neighborhood selection (NS) algorithm [MB06] solves a convex relaxation of the following sparsity constrained optimization to regress each variable against its neighbors

 ˆβj= argminβ∈Rd−11mm∑i=1(xij−x⊤i(j)β)2, (10) s.t. ∥β∥0≤k,for each j∈[d],

where denotes the -th coordinate of , and denotes the index set (with removed). Let denote ’s -th column with the diagonal entry removed. and denote the -th diagonal element of . Then, the sparsity pattern of can be estimated through . Details on the connection between and are given in Appendix C.

However, given -corrupted samples from the Gaussian graphical model, this procedure will fail [LHY12b, WG17]. To address this issue, we propose Robust NS (Algorithm 2 in Appendix C), which robustifies Neighborhood Selection [MB06] by using Trimmed Hard Thresholding (with loss function eq. 2a) to robustify eq. 10. Similar to creftypecap 5.1, a -condition guarantees consistent edge selection.

###### Corollary 5.1.

Under the same condition as in creftypecap 4.1, and a -condition for ,

 θ(j),min=Ω(Θ1/2j,jρ2(ϵ√klog(nd)+√klogdn)),

Robust NS (Algorithm 2) achieves consistent edge selection, with probability at least .

Similar to creftypecap 4.1, the fraction is nearly independent of dimension , which provides guarantees of Robust NS in high dimensions. Other Gaussian graphical model selection algorithms include GLasso [FHT08], CLIME[CLL11]. Experiments comparing robustified versions of these algorithms are given in Section 6.

## 6 Experiments

We demonstrate empirical performance of Trimmed Hard Thresholding (Algorithm 1 and Algorithm 2), and the complete experiment setup is in Appendix D.

### 6.1 Synthetic data – sparse linear models.

#### Sparse linear regression.

In the first experiment, we generate samples from an exact sparse linear regression model (creftypecap 3.1) with a Toeplitz covariance . Here, the stochastic noise , and we vary the noise level in different simulations 444 Beyond sub-Gaussian error distributions, our algorithm naturally extends to other robust loss functions (e.g., Huber loss). In Appendix D, we study the empirical performance of Algorithm 1

with Huber loss where the stochastic noise follows a heavy-tailed Cauchy distribution.

.

We fix , and track the parameter error in each iteration. Left plot of Figure (a)a shows Algorithm 1’s linear convergence, and the error curves flatten out at the final error level. Furthermore, Algorithm 1 can achieve machine precision when , which means exactly recovering of .

#### Misspecified model.

Here we consider the best sparse linear approximation under model misspecification. We use the same Toeplitz covariates and true parameter as before, but . Although this is a non-linear function, the best sparse linear approximation can select relevant features because is sparse.

For simplicity, we track the function evaluated on all authentic samples . In the right plot of Figure (a)a, we compare the performance of Algorithm 1 under different against the oracle curve (IHT only on authentic samples). The right plot has similar convergence under different , and shows the robustness of Algorithm 1 without assuming an underlying linear model.

#### Sparse logistic regression.

For binary classification problem, we generate data using a sparse LDA model. We run Algorithm 1 with logistic loss under different levels of corruption. In the left plot of Figure (a)a, we observe similar linear convergence as sparse linear regression problem, and this is consistent with creftypecap 4.2.

We then compare Algorithm 1 with Trimmed Lasso estimator [YLA18], which uses alternating minimization for sparse logistic regression. Under different dimensions , the right plot of Figure (a)a shows classification error evaluated on authentic test set for , which demonstrates that Trimmed Hard Thresholding is better than Trimmed Lasso.

### 6.2 Synthetic data – Gaussian graphical model.

We draw samples from a Gaussian graphical model with a sparse precision matrix using huge [ZLR12], and add corruptions. We compare Algorithm 2 with other robust graphical model estimators, which include Trimmed GLasso [YLA18], RCLIME [WG17], Skeptic [LHY12b], and Spearman [LT18]. We fix , and vary . We show results for different off-diagonal values (Low SNR) and (High SNR), with High SNR plots in Appendix D. Figure (a)a shows model selection performance measured by receiver operating characteristic (ROC) curves. For the whole regularization path, our algorithm (denoted as Robust NS) has a better ROC compared to other algorithms.

In particular, Robust NS outperforms other methods with higher true positive rate when the false positive rate is small. This validates our theory in creftypecap 5.1, guaranteeing sparsity recovery when hard thresholding hyper-parameter is suitably chosen to match ’s sparsity .

### 6.3 Real data experiments.

To demonstrate the efficacy of Algorithm 2, we apply it to a US equities dataset [LHY12a, ZLR12], which is heavy-tailed and has many outliers [dP18]. The dataset contains 1,257 daily closing prices of 452 stocks (variables).

It is well known that stocks from the same sector tend to be clustered together [Kin66]. Therefore, we use Robust NS (Algorithm 2) to construct an undirected graph among stocks. Graphs estimated by different algorithms are shown in Figure 12. We can see that stocks from the same sector are clustered together, and these clustering centers can be easily identified. We also compare Algorithm 2 to the baseline NS approach (without no consideration for corruptions or outliers). We can observe that stocks from Information Technology (colored by purple) are much better clustered by Algorithm 2.

## References

• [ACG13] Andreas Alfons, Christophe Croux, and Sarah Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, pages 226–248, 2013.
• [ANW12] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright.

Fast global convergence of gradient methods for high-dimensional statistical recovery.

Ann. Statist., 40(5):2452–2482, 10 2012.
• [BD09] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
• [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017.
• [BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
• [BJK17] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
• [Box53] George EP Box. Non-normality and tests on variances. Biometrika, 40(3/4):318–335, 1953.
• [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

Foundations and Trends® in Machine Learning

, 8(3-4):231–357, 2015.
• [BvdG11] Peter Bühlmann and Sara van de Geer.

Statistics for high-dimensional data: methods, theory and applications

.
Springer Science & Business Media, 2011.
• [BWY17] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
• [CCM13] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
• [CGR18] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huber’s contamination model. Ann. Statist., 46(5):1932–1960, 10 2018.
• [CLL11] Tony Cai, Weidong Liu, and Xi Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
• [CSX17] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
• [DGL13] Luc Devroye, László Györfi, and Gábor Lugosi.

A probabilistic theory of pattern recognition

, volume 31.
Springer Science & Business Media, 2013.
• [DKK16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
• [DKK18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
• [dP18] M. L. de Prado. Advances in Financial Machine Learning. Wiley, 2018.
• [FHT08] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
• [Gao17] Chao Gao. Robust regression via mutivariate regression depth. arXiv preprint arXiv:1702.04656, 2017.
• [HI17] Matthew J Holland and Kazushi Ikeda. Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182, 2017.
• [HRRS11] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
• [HTW15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
• [Hub64] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73–101, 1964.
• [Hub11] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
• [JP78] David Johnson and Franco Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107, 1978.
• [JTK14] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
• [KFB09] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
• [Kin66] Benjamin King. Market and industry factors in stock price behavior. the Journal of Business, 39(1):139–190, 1966.
• [KKM18] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient Algorithms for Outlier-Robust Regression. arXiv preprint arXiv:1803.03241, 2018.
• [KP18] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055, 2018.
• [LB18] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative thresholding algorithms. arXiv preprint arXiv:1804.08841, 2018.
• [LHY12a] Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. The nonparanormal skeptic. arXiv preprint arXiv:1206.6488, 2012.
• [LHY12b] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman, et al. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
• [Li13] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013.
• [Loh17] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust -estimators. The Annals of Statistics, 45(2):866–896, 2017.
• [LPR15] Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classification rates for high-dimensional gaussian generative models. In Advances in Neural Information Processing Systems, pages 1054–1062, 2015.
• [LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
• [LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. arXiv preprint arXiv:1805.11643, 2018.
• [LSRC15] Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of 1-regularized m-estimators. In AISTATS, 2015.
• [LT18] Po-Ling Loh and Xin Lu Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under -contamination. Electronic Journal of Statistics, 12(1):1429–1467, 2018.
• [LW01] Cheng Li and Wing Hung Wong.

Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection.

Proceedings of the National Academy of Sciences, 98(1):31–36, 2001.
• [LYCR17] Tianyang Li, Xinyang Yi, Constantine Caramanis, and Pradeep Ravikumar. Minimax gaussian classification & clustering. In Artificial Intelligence and Statistics, pages 1–9, 2017.
• [MB06] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
• [NRWY12] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
• [Rou84] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
• [RWL10] Pradeep Ravikumar, Martin J Wainwright, and John D Lafferty. High-dimensional ising model selection using -regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
• [RWRY11] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
• [RWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu.

Restricted eigenvalue properties for correlated gaussian designs.

Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
• [RWY11] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
• [SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
• [SL17] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650–7691, 2017.
• [SS18] Yanyao Shen and Sujay Sanghavi. Iteratively learning from the best. arXiv preprint arXiv:1810.11874, 2018.
• [SX18] Lili Su and Jiaming Xu. Securing distributed machine learning in high dimensions. arXiv preprint arXiv:1804.10140, 2018.
• [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
• [Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.

The nature of statistical learning theory

.
Springer Science & Business media, 2013.
• [vdV00] Aad van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.
• [VMX17] Daniel Vainsencher, Shie Mannor, and Huan Xu. Ignoring is a bliss: Learning with large noise through reweighting-minimization. In Conference on Learning Theory, pages 1849–1881, 2017.
• [Wai09] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
• [Wai19] Martin Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
• [WG17] Lingxiao Wang and Quanquan Gu. Robust gaussian graphical model estimation with arbitrary corruption. In International Conference on Machine Learning, pages 3617–3626, 2017.
• [WJ08] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
• [YCRB18a] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018.
• [YCRB18b] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle point attack in byzantine-robust distributed learning. arXiv preprint arXiv:1806.05358, 2018.
• [YL15] Eunho Yang and Aurélie C Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems, pages 2602–2610, 2015.
• [YLA18] Eunho Yang, Aurélie C Lozano, and Aleksandr Aravkin. A general family of trimmed estimators for robust high-dimensional data analysis. Electronic Journal of Statistics, 12(2):3519–3553, 2018.
• [YLZ18] Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18(166):1–43, 2018.
• [ZLR12] Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. The huge package for high-dimensional undirected graph estimation in r. Journal of Machine Learning Research, 13(Apr):1059–1062, 2012.
• [ZWJ14] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948, 2014.

## Appendix A Proofs for the trimmed gradient estimator

Let us first visit the definition and tail bounds of sub-exponential random variable, as it will be used in sparse linear regression, where the gradient’s distribution is indeed sub-exponential.

We first present standard concentration inequalities ([Wai19]).

###### Definition A.1 (Sub-exponential random variables).

A random variable with mean is sub-exponential if there are non-negative parameters such that

 E[exp(t(X−μ))]≤exp(ν2t22),for all |t|<1ν.
###### Lemma A.1 (Bernstein’s inequality).

Suppose that , are i.i.d. sub-exponential random variables with parameters . Then equationparentequation

 exp(−nt22ν2)if 0≤t≤ν, and exp(−nt2ν) for t>ν.

We also have a two-sided tail bound

Similar to creftypecap 3.1, we define -trimmed mean estimator for one dimensional samples, and denote it as .

###### Definition A.2 (α-trimmed mean estimator).

Given a set of -corrupted samples , the dimensional trimmed mean estimator removes the largest and smallest fraction of elements in , and calculate the mean of the remaining terms. We choose , for a constant . We also require that , for some small constant .

In Trimmed Hard Thresholding (Algorithm 1), we first use trimmed mean estimator for each coordinate of gradients. creftypecap A.2 shows the guarantees for this robust gradient estimator in each coordinate. We note that creftypecap A.2 is stronger than guarantees for trimmed mean estimator (Lemma 3) in [YCRB18a].

In our contamination model creftypecap 2.1, the adversary may delete -fraction of authentic samples, and then add arbitrary outliers. And creftypecap A.2 provides guarantees for trimmed mean estimator on sub-exponential random variables. The trimmed mean estimator is robust enough, that it allows the adversary to arbitrarily remove -fraction of data points. We use to denote the samples at the -th coordinate of . We can also define in the same way.

###### Lemma A.2.

Suppose we observe -corrupted samples from creftypecap 2.1. For each dimension , we assume the samples in are i.i.d. -sub-exponential with mean . After the contamination, we have the -th samples as . Then, we can guarantee the trimmed mean estimator on -th dimension that

 ∣∣trmeanα{xi:i∈Sj}−μj∣∣=O(ν(ϵlog(nd)+√logdn))

with probability at least .

We leave the proof of creftypecap A.2 at the end of this section. Then, we present separate analysis of trimmed gradient estimator for sparse linear regression and sparse logistic regression by using creftypecap A.2.

### a.1 Sparse linear regression

In this subsection, we use creftypecap A.2 to bound . We will show that, in the sparse linear regression model with sub-Gaussian covariates, the distribution of authentic gradients are sub-exponential instead of sub-Gaussian.

In creftypecap A.1, we first prove that when the current parameter iterate is , the sub-exponential parameter of all authentic gradient is , where