A General Family of Trimmed Estimators for Robust High-dimensional Data Analysis

We consider the problem of robustifying high-dimensional structured estimation. Robust techniques are key in real-world applications which often involve outliers and data corruption. We focus on trimmed versions of structurally regularized M-estimators in the high-dimensional setting, including the popular Least Trimmed Squares estimator, as well as analogous estimators for generalized linear models and graphical models, using possibly non-convex loss functions. We present a general analysis of their statistical convergence rates and consistency, and then take a closer look at the trimmed versions of the Lasso and Graphical Lasso estimators as special cases. On the optimization side, we show how to extend algorithms for M-estimators to fit trimmed variants and provide guarantees on their numerical convergence. The generality and competitive performance of high-dimensional trimmed estimators are illustrated numerically on both simulated and real-world genomics data.

Authors

• 46 publications
• 7 publications
• 14 publications
• Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso

Gaussian Graphical Models (GGMs) are popular tools for studying network ...
10/28/2015 ∙ by Eunho Yang, et al. ∙ 0

• Minimum Distance Estimation for Robust High-Dimensional Regression

We propose a minimum distance estimation method for robust regression in...
07/11/2013 ∙ by Aurélie C. Lozano, et al. ∙ 0

• Maximum Regularized Likelihood Estimators: A General Prediction Theory and Applications

Maximum regularized likelihood estimators (MRLEs) are arguably the most ...
10/09/2017 ∙ by Rui Zhuang, et al. ∙ 0

• Communication-efficient sparse regression: a one-shot approach

We devise a one-shot approach to distributed sparse regression in the hi...
03/14/2015 ∙ by Jason D. Lee, et al. ∙ 0

• A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees

Sparse high dimensional graphical model selection is a topic of much int...
07/20/2013 ∙ by Kshitij Khare, et al. ∙ 0

• Robust Estimation in High Dimensional Generalized Linear Models

Generalized Linear Models are routinely used in data analysis. The class...
09/29/2017 ∙ by Marina Valdora, et al. ∙ 0

• Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure

We focus on the problem of estimating the change in the dependency struc...
10/30/2017 ∙ by Beilun Wang, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of high-dimensional estimation, where the number of variables may greatly exceed the number of observations Such high-dimensional settings are becoming increasingly prominent in a variety of domains, including signal processing, computational biology and finance. The development and the statistical analysis of structurally constrained estimators for high-dimensional estimation has recently attracted considerable attention. These estimators seek to minimize the sum of a loss function and a weighted regularizer. The most popular example is that of Lasso (Tibshirani 1996), which solves an -regularized (or equivalently -constrained) least squares problem. Under sub-Gaussian errors, Lasso has been shown to have strong statistical guarantees (van de Geer and Buhlmann 2009; Wainwright 2009). Regularized maximum likelihood estimators (MLEs) have been developed for sparsity-structured Generalized Linear Models (GLMs), with theoretical guarantees such as and -consistency (Negahban et al. 2012), and model selection consistency (Bunea 2008). For matrix-structured regression problems, estimators using nuclear-norm regularization have been studied e.g. by Recht et al. (2010). Another prime example is that of sparse inverse covariance estimation for graphical model selection (Ravikumar et al. 2011).

In practice, however, the desirable theoretical properties of such regularized M-estimators can be compromised, since outliers and corruptions are often present in high-dimensional data problems. These challenges motivate the development of robust structured learning methods that can cope with observations deviating from the model assumptions. The problem of reliable high-dimensional estimation under possibly gross error has gained increasing attention. Relevant prior work includes the “extended” Lasso formulation (Nguyen and Tran 2013)

which incorporates an additional sparse error vector to the original Lasso problem so as to account for corrupted observations, the LAD-Lasso

(Wang et al. 2007) which uses the least absolute deviation combined with an penalty, and the Robust Matching Pursuit method of Chen et al. (2013)

which performs feature selection based on a trimmed inner product of the features with the residuals, rather than a full inner product, so as to alleviate the impact of corrupted observations. In general, however, extending

-estimators beyond the least squares case is challenging. For example, Yang et al. (2013); Tibshirani and Manning (2014) extend the strategy in Nguyen and Tran (2013) to generalized linear models in two ways: the first requires modeling errors in the input space, which maintains convexity of the objective but imposes stringent conditions for consistency; the other modeling errors in the output space, breaks convexity and yield milder conditions.

A key motivation for trimmed approaches is that convex loss functions with linear tail growth (such as the -norm and Huber loss) are not robust enough. As Alfons et al. (2013) points out, both of these approaches have a breakdown point of , since even a single gross contamination can arbitrarily distort the regression coefficients. Remarkably, the median of least squares residual originally proposed by Rousseeuw (1984) avoids this problem, reaching breakdown point of nearly 50%; the approach is equivalent to ‘trimming’ a portion of the largest residuals. This lead to the consideration of sparse Least Trimmed Squares (sparse LTS) for robust high-dimensional estimation. While Alfons et al. (2013) established high breakdown point property for sparse LTS, its statistical convergence has not been previously analyzed.

In this paper, we present a unified framework and statistical analysis for trimmed regularized M-estimators, generalizing the sparse least trimmed squares (Sparse LTS) estimator (Alfons et al. 2013) to allow for a wide class of (possibly non-convex) loss functions as well as structured regularization. Using our analysis, we derive error bounds for the sparse LTS estimator. These require less stringent conditions for estimation consistency than those of Extended Lasso. We also derive error bounds for sparse Gaussian graphical models (GGMs) as a specific example. In contrast, existing approaches for robust sparse GGMs estimation lack statistical guarantees.

In terms of optimization-side, we use partial minimization to extend existing optimization algorithms for M-estimators to trimmed formulations. An important example of the approach is a modified proximal gradient method. For convex M-estimators, we show that under moderate assumptions, the ‘trimming’ is completed in finitely many steps, and thereafter the method reduces to a descent method for a convex problem over a fixed set of identified ‘inliers’. We use simulated data to compare with competing methods, and then apply our approach to real genomics datasets.

The manuscript is organized as follows. In Section 2 we introduce the general setup and present the family of High-Dimensional Trimmed estimators. The main theoretical results on their convergence and consistency are stated in Section 3, along with corollaries for linear models and Gaussian graphical models respectively. The partial minimization approach for optimization is described in Section 4. Empirical results are presented in Section 5 for simulated data and Section 6 for the analysis of genomics datasets . All proofs are collected in the Appendix.

2 A General Framework for High-Dimensional Trimmed Estimators

Motivating Example 1: Linear Regression.

To motivate high-dimensional trimmed estimators, we start with high-dimensional linear regression. The real-valued observation

comes from the linear model

 yi=⟨xi,θ∗⟩+oi,i=1,…,n, (1)

where is a covariate, the true regression parameter vector is , and is the observation noise. Since outliers are commonly present in high-dimensional data problems, we assume is substantially larger than without loss of generality.

Let be the set of “good” samples, and denote the set of “bad” samples arbitrary corrupted. We are particularly concerned with the scenario where all samples in are potentially badly (arbitrarily) corrupted.

In order to cope with observations that deviate from the true model, Alfons et al. (2013) proposed sparse LTS, an -penalized version of the classical least trimmed squares (LTS) estimator (Rousseeuw 1984) solving

 minimizeθ12hh∑i=1[r2(θ)](i)+λ∥θ∥1, (2)

where with , and are the order statistics of the squared residuals . Alfons et al. (2013) established the breakdown point of the resulting sparse LTS estimator, and proposed an iterative algorithm for its computation. At iteration , the algorithm computes the Lasso solution based on the current subset of observations with and constructs the next subset from the observations corresponding to the smallest squared residuals.

Our starting point is the following reformulation of regularized LTS problem:

 minimizew∈Δh,θ∈ρB1 12hn∑i=1wi(yi−⟨xi,θ⟩)2+λ∥θ∥1 (3)

where is the -scaled capped unit simplex, is the -norm ball, and the constraint (or equivalently ) ensures that the optimum of non-convex problem (3) exists as discussed in Loh and Wainwright (2015). This constraint on is a theoretical safeguard, since problem (3) is equivalent to the problem (2) when is large enough.

A family of trimmed estimators.

Based on the reformulation (3), we propose the family of trimmed estimators for general high-dimensional problems: given a collection of arbitrary corrupted samples , and a differentiable (possibly non-convex) loss function , we solve

 minimizew∈Δh,θ∈ρBRf(w,θ):= 1hn∑i=1wi¯L(θ;Zi)+λR(θ) (4)

where is a decomposable norm used as a regularizer (Negahban et al. 2012) to encourage particular low-dimensional structure of the estimator, and is the unit ball for (in other words, the constraint ensures ). decides the number of samples (or sum of weights) used in the training. is ideally set as the number of uncorrupted samples in , but practically we can tune the parameter by cross-validation.

Motivating Example 2: Graphical Models.

Gaussian graphical models (GGMs) form a powerful class of statistical models for representing distributions over a set of variables (Lauritzen 1996). These models employ undirected graphs to encode conditional independence assumptions among the variables, which is particularly convenient for exploring network structures. GGMs are widely used in variety of domains, including computational biology (Oh and Deasy 2014)

(Manning and Schutze 1999), image processing (Woods 1978; Hassner and Sklansky 1978; Cross and Jain 1983), statistical physics (Ising 1925), and spatial statistics (Ripley 1981).

In such high-dimensional settings, sparsity constraints are particularly pertinent for estimating GGMs, as they encourage only a few parameters to be non-zero and induce graphs with few edges. The most widely used estimator, the Graphical Lasso minimizes the negative Gaussian log-likelihood regularized by the norm of the entries (or the off-diagonal entries) of the precision matrix (see Yuan and Lin (2007); Friedman et al. (2007); Bannerjee et al. (2008)). This estimator enjoys strong statistical guarantees (see e.g. Ravikumar et al. (2011)). The corresponding optimization problem is a log-determinant program that can be solved with interior point methods (Boyd and Vandenberghe 2004) or by co-ordinate descent algorithms (Friedman et al. 2007; Bannerjee et al. 2008). Alternatively neighborhood selection (Meinshausen and Bühlmann 2006; Yang et al. 2012) can be employed to estimate conditional independence relationships separately for each node in the graph, via Lasso linear regression (Tibshirani 1996). Under certain assumptions, the sparse GGM structure can still be recovered even under high-dimensional settings.

The aforementioned approaches rest on a fundamental assumption: the multivariate normality of the observations. However, outliers and corruption are frequently encountered in high-dimensional data (see e.g. Daye et al. (2012) for gene expression data). Contamination of a few observations can drastically affect the quality of model estimation. It is therefore imperative to devise procedures that can cope with observations deviating from the model assumption. Despite this fact, little attention has been paid to robust estimation of high-dimensional graphical models. Partially Relevant work includes Finegold and Drton (2011), which leverages multivariate -distributions for robustified inference and the EM algorithm. They also propose an alternative -model which adds flexibility to the classical but requires the use of Monte Carlo EM or variational approximation as the likelihood function is not available explicitly. Another pertinent work is that of Sun and Li (2012) which introduces a robustified likelihood function. A two-stage procedure is proposed for model estimation, where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently re-estimated using iterative proportional fitting so as to guarantee positive definiteness of the final estimate.

A special case of the proposed family is that of the Trimmed Graphical Lasso for robust estimation of sparse GGMs:

 minimizeΘ∈Ω∩RB1,w∈Δh ⟨⟨Θ,1hn∑i=1wiX(i)(X(i))⊤⟩⟩−logdet(Θ)+λ∥Θ∥1,off. (5)

Here for matrices and , denotes the trace inner product . For a matrix and parameter , denotes the element-wise norm, and does the element-wise norm only for off-diagonal entries. For example, .

We provide statistical guarantees on the consistency of this estimator. To the best of our knowledge, this is in stark contrast with prior work on robust sparse GGM estimation (e.g.  Finegold and Drton (2011); Sun and Li (2012)) which are not statistically guaranteed in theory.

3 Statistical Guarantees of Trimmed Estimators

In this section, we provide a statistical analysis of the family of structurally regularized estimators (4). In order to simplify the notation in our theorem and its corollaries, we assume without loss of generality that the number of good samples is known a priori and the tuning parameter in (4) is exactly set as the genuine samples size, . This is an unrealistic assumption, however, as long as we set smaller than , the statements in the main theorem and its corollaries can be applied as they are.

Noting that the optimization problem (4) is non-convex, estimators returned by iterative methods for (4) will be stationary points. We call a local minimum of (4) when

1. is a local minimum of and

2. is a global minimum of .

These are precisely the points that are found by the algorithms developed in Section 4. In this section, we give statistical error bounds for any such points.

Consider any such local minimum . While we are mainly interested in the error bounds of our estimator for target parameter (that is, ), we first define as follows: for the index , is simply set to so that . Otherwise for the index , we set . Note that while is fixed unconditionally, is dependent on . However, is fixed given .

In order to guarantee bounded errors, we first assume that given , the following restricted strong convexity condition for holds:

1. [leftmargin=0.2cm, itemindent=1.2cm,label=(C-), ref=(C-),start=1]

2. (Restricted strong convexity (RSC) on ) We overload notation and use to denote . Then, for any possible , the differentiable loss function satisfies

 ⟨∇θL(θ∗+Δ,w∗)−∇θL(θ∗,w∗),Δ⟩≥κl∥Δ∥22−τ1(n,p)R(Δ)2,

where is a curvature parameter, and is a tolerance function on and .

Note that this condition is slightly different from the standard restricted strong convexity condition because of the dependency on and therefore on . Each local optimum has its own restricted strong convexity condition. In case of no corruption with for all , this condition will be trivially reduced to the standard RSC condition, under which the standard general -estimator has been analyzed (see Negahban et al. (2012) for details).

We additionally require the following condition for a successful estimation with (4) on corrupted samples:

1. [leftmargin=0.2cm, itemindent=1.2cm,label=(C-), ref=(C-), start=2]

2. Consider arbitrary local optimum . Letting and ,

1 can be understood as a structural incoherence condition between and . This type of condition is also needed for the guarantees of extended LASSO (Nguyen and Tran 2013) and other dirty statistical models with more than a single parameter (Yang and Ravikumar 2013). Note again that due to the dependency on , each local optimum will have its own conditions 1 and 1. We will see later in this section that these two conditions are mild enough for the popular estimators (such as linear models and GGMs) to satisfy.

Armed with these conditions, we state the main theorem on the error bounds of (4):

Theorem 1.

Consider an -estimator from (4) with any local minimum , and suppose that it satisfies the conditions 1 and 1. Suppose also that the regularization parameter in (4) is set as

 λ≥4max{ R∗(∇θL(θ∗,w∗)),2ρτ1(n,p)+τ3(n,p)} (6)

where is the dual norm of : . Then the following error bounds for are guaranteed for a given model space :

 ∥˜θ−θ∗∥2≤1κl(3λΨ2+τ2(n,p))andR(˜θ−θ∗)≤2λκl(2λΨ+τ2(n,p))2,

where

 Ψ:=supu∈M∖{0}R(u)/∥u∥2

measures the compatibility between and norms.

For sparse vectors, where is the sparsity of true parameter , and is the space of vectors with the correct support set (Negahban et al. 2012).

The statement in Theorem 1 is applicable to any local minimum of (4), and it holds deterministically. Probabilistic statements come in when the condition on specified in Theorem 1 is satisfied. In (6), is chosen based on similarly to Negahban et al. (2012). We shall see that the remaining terms with tolerance functions in (6) have the same order as for the specific cases of linear models and GGMs developed in the next sections.

3.1 Statistical Guarantees of High-Dimensional Least Trimmed Squares

We now focus on the special case of high-dimensional linear regression, and apply Theorem 1 to problem (3). In particular, if , where the observation noise follows zero mean and has sub-Gaussian tails. Otherwise, for , where is the amount of arbitrary corruption.

In order to derive an actual bound from the general framework of Theorem 1, we consider the following natural setting, which has been widely studied in past work on conventional high dimensional linear models:

1. [leftmargin=0.5cm, itemindent=2cm,label=(LTS1), ref=(LTS1)]

2. (-Gaussian ensemble) Each sample is i.i.d. sampled from .

1. [leftmargin=0.5cm, itemindent=2cm,label=(LTS2), ref=(LTS2)]

2. (Sub-Gaussian noise) The noise vector is zero-mean and has sub-Gaussian tails, which means that for any fixed vector such that , for all

. The sub-Gaussian is quite a wide class of distributions, and contains the Gaussian family as well as and all bounded random variables.

1. [leftmargin=0.5cm, itemindent=2cm,label=(LTS3), ref=(LTS3)]

2. (Column normalization) Let be the design matrix whose -th row is the covariate -th sample: , and be the -th column vector of . Then, . As pointed out in Negahban et al. (2012), we can always rescale linear models with out loss of generality to satisfy this condition.

The following assumptions are required for our estimator to be resilient to outliers and strongly consistent:

1. [leftmargin=0.5cm, itemindent=1.8cm,label=(C-), ref=(C-)]

2. Let be the number of good samples: and hence . Then, we assume that larger portion of samples are genuine and uncorrupted so that where . If we assume that 40% of samples are corrupted, then .

1. [leftmargin=0.5cm, itemindent=2cm,label=(LTS4), ref=(LTS4)]

2. We set the tuning parameter in (3) as for some constant . This setting requires that the number of good samples is larger than or equal to so that the true regression parameter is feasible for the objective.

Under these conditions, we can recover the following error bounds of high-dimensional LTS (3), as a corollary of Theorem 1:

Corollary 1.

Consider corrupted linear models (1) when . Suppose that conditions 1, 1, 1, 1, and 1 hold. Also suppose that we find a local minimum of (3), choosing

 λ=c√logph

where is some constant dependent on , and the upper bound of .111Here without loss of generality, we can assume that is bounded by some constant since we can always rescale the linear models properly without changing the signal . Then, is guaranteed to satisfy 1 and 1 for the specific case of (3), and have the following error bounds: for some constant depending on , and the portion of genuine samples in 1, and some constant smaller than ,

 ∥˜θ−θ∗∥2≤ c′(√klogph+c′′√|B|logph), ∥˜θ−θ∗∥1≤ 4c′(√klogph+c′′√|B|logph)2

with probability at least

for some universal positive constants and .

Note that Corollary 1 concerns any single local minimum. For the guarantees of multiple local optima simultaneously, we may use a union bound from the corollary.

Remarks.

It is instructive to compare the error rates and conditions in Corollary 1 with statistical guarantees of extended Lasso analyzed in Nguyen and Tran (2013). The extended Lasso estimator solves:

 minimizeθ,e 12nn∑i=1(yi−⟨xi,θ⟩−ei)2+λθ∥θ∥1+λe∥e∥1

where is the regularization parameter for parameter capturing corruptions. is encouraged to be sparse to reflect the fact that only a fraction of samples is corrupted. The norm-based error rate in Corollary 1 is almost the same as that of extended Lasso: under the standard Gaussian design setting 1. As long as at least a linear fraction of samples is not contaminated (that is, for ), the error rates for both estimators will be asymptotically the same.

However, it is important to revisit the conditions required for the statistical guarantees of extended Lasso. Besides an extended version of the restricted eigenvalue condition,

Nguyen and Tran (2013) assumes a mutual incoherence condition, which in turn requires for some large and fixed constant . Provided that and are fixed, the inequality can hold for a large enough sample size . However, when grows with , this condition will be violated; for example if (i) a square root fraction of samples is corrupted () for a fixed or (ii) a linear fraction of is corrupted (), then can easily exceed . Our experimental results of Section 5 will confirm this observation: as the fraction of corruptions increases, the performance of extended Lasso deteriorates compared to that of our estimator (3).

Statistical Guarantees When Covariates Are Corrupted.

In the linear model (1

), corruption is considered in the space of the response variable

: namely an additional random variable is used to model corruption in the response space. Even in the case where we have outliers with corrupted covariates , can be understood as the mean-shift variable to model . For linear models, modeling outliers in the parameter space or modeling them in the output space is thus equivalent (In constrast, for more general GLM settings, the link function is not the identity function and both approaches are distinct, see e.g. (Yang et al. 2013)). Nevertheless, when outliers stem from corrupted covariates, condition 1 might be violated. For this setting, we introduce the following alternative condition:

1. [leftmargin=0.5cm, itemindent=2cm,label=(LTS5), ref=(LTS5)]

2. (-Gaussian ensemble) Each sample in is i.i.d. sampled from . Let be the sub-design matrix in corresponding to outliers. Then, we define such that .

Under condition 1 we recover results similar to Corollary 1:

Corollary 2.

Consider linear models in (1) where . Suppose that all the conditions 1, 1, 1, 1 and 1 hold. Also suppose that we choose the regularization parameter

 λ=c√logph

where is some constant dependent on , and and the upper bound of . Then, is guaranteed to have the following error bounds as before: for some constant depending on , and the portion of genuine samples in 1, and some constant smaller than ,

 ∥˜θ−θ∗∥2≤ c′(√klogph+c′′√|B|logph), ∥˜θ−θ∗∥1≤ 4c′(√klogph+c′′√|B|logph)2

with probability at least for some universal positive constants and .

3.2 Statistical Guarantees of Trimmed Graphical Lasso

We now focus on Gaussian graphical models and provide the statistical guarantees of our Trimmed Graphical Lasso estimator as presented in Section 2 (Motivating Example 2). Our theory in this section provides the statistical error bounds on any local minimum of (5). We use and to denote the Frobenius and spectral norms, respectively.

Let be a zero-mean Gaussian random field parameterized by concentration matrix :

 P(X;Θ∗)=exp(−12⟨⟨Θ∗,XX⊤⟩⟩−A(Θ∗)) (7)

where

is the log-partition function of Gaussian random field. Here, the probability density function in (

7) is associated with

-variate Gaussian distribution,

where .

We consider the case where the number of random variables may be substantially larger than the number of sample size , however, the concentration parameter of the underlying distribution is sparse so that the number of non-zero off-diagonal entries of is at most : .

We now investigate how easily we can satisfy the conditions in Theorem 1. Intuitively it is impossible to recover true parameter by weighting approach as in (5) when the amount of corruptions exceeds that of normal observation errors.

To this end, suppose that we have some upper bound on the corruptions:

1. [leftmargin=0.5cm, itemindent=2cm,label=(TGL1), ref=(TGL1)]

2. For some function , we have

where denotes the sub-design matrix in corresponding to outliers. Under this assumption, we can recover the following error bounds of Trimmed Graphical Lasso (5), as a new corollary of Theorem 1:

Corollary 3.

Consider corrupted Gaussian graphical models with conditions 1 and 1. Suppose that we compute the local optimum of (5) choosing

 λ=4max{8(maxiΣ∗ii)√30logph−|B|+|B|h∥Σ∗∥∞,f(XB)√logph}≤c1−f(XB)√|B|logph3R.

Then, is guaranteed to satisfy 1 and 1 for the specific case of (5) and have the error bounds of

 ∥˜Θ−Θ∗∥F≤1κl⎛⎝3λ√k+p2+f(XB)√|B|logph⎞⎠and ∥˜Θ−Θ∗∥1,off≤2λκl⎛⎝3λ√k+p+f(XB)√2|B|logpn⎞⎠2 (8)

with probability at least for some universal positive constants and .

In Corollary 3, the term captures the relation between element-wise norm and the error norm including diagonal entries.

If we further assume that the number of corrupted samples scales with at most :

1. [leftmargin=0.5cm, itemindent=2cm,label=(TGL2), ref=(TGL2)]

2. for some constant ,

then we can derive the following result as another corollary of Theorem 1:

Corollary 4.

Consider corrupted Gaussian graphical models, and compute the local minimum of (5), setting

 λ=c√logpn,c:=4max{16(maxiΣ∗ii)√15+2a∥Σ∗∥∞√logp,√2f(XB)}.

Suppose that the conditions 1, 1 and 1 hold. Then, if the sample size is lower bounded as

 n≥max{16a2,(|||Θ∗|||2+1)4(3Rc+f(XB)√2|B|)2(logp)},

then is guaranteed to satisfy 1 and 1 for the specific case of (5) and have the following error bound:

 ∥˜Θ−Θ∗∥F≤1κl⎛⎝3c2√(k+p)logpn+f(XB)√2|B|logpn⎞⎠ (9)

with probability at least for some universal positive constants and .

Note that an -norm error bound can also be easily derived using the selection of from (3).

Remarks.

Corollary 4 reveals an interesting result: even when samples out of total samples are corrupted, our estimator (5) can successfully recover the true parameter with guaranteed error in (9). The first term in this bound is which exactly matches the Frobenius error bound for the case without outliers (see Ravikumar et al. (2011); Loh and Wainwright (2013) for example). Due to the outliers, the performance degrades with the second term, which is . To the best of our knowledge, our results are the first statistical error bounds available in the litterature on parameter estimation for Gaussian graphical models with outliers.

When Outliers Follow a Gaussian Graphical Model.

Now let us provide a concrete example and show how in 1 is precisely specified in this case:

1. [leftmargin=0.5cm, itemindent=2cm,label=(TGL3), ref=(TGL3)]

2. Outliers in the set are drawn from another Gaussian graphical model (7) with a parameter .

This can be understood as a Gaussian mixture model where most of the samples are drawn from

which we want to estimate, and a small portion of samples are drawn from . In this case, Corollary 4 can be further shaped as follows:

Corollary 5.

Suppose that the conditions 1, 1 and 1 hold. Then the statement in Corollary 4 holds with .

4 Optimization for Trimmed Estimators

While the objective function in (4) is non-convex in , it simplifies for block or held fixed. Perhaps for this reason, prior algorithms for trimmed approaches (Rousseeuw 1984; Alfons et al. 2013) alternated between solving for and . Unfortunately, each solve in is as expensive as finding the original (untrimmed) estimator.

Here, we take advantage of the fact that the computational complexity of the two subproblems in and are completely different. With fixed, the problem in is equivalent to classic high-dimensional problems, e.g. Lasso, which is typically solved by first order methods. In contrast, the problem in for fixed

is the simple linear program

 minimizew∈Δh n∑i=1wi¯L(θ;Zi) (10)

with all dependence on the predictors captured by the current losses . The solution is obtained setting for the smallest values of , and setting remaining to .

We exploit structure, using partial minimization. Similar ideas have been used for optimizing a range of nonlinear least squares problems (Golub and Pereyra 2003) as well as more general problems involving nuisance parameters (Aravkin and Van Leeuwen 2012). Rather than an alternating scheme (similar to that of Alfons et al. (2013) for least squares) where we solve multiple ‘weighted’ regularized problems to completion, we can rewrite the problem as follows:

 (11)

Problem (11) is equivalent to (4). The reader can verify that is not smooth222When , trimming equates to minimizing the minimum of , a problem which is nonsmooth and nonconvex.. However, partial minimization provides a way to modify any descent method for fitting an M-estimator to bear on the corresponding trimmed estimator (11). Algorithm 1 gives a description of the steps involved for the specific case of extending proximal gradient descent. The algorithm uses the proximal mapping, which for the case of regularization is the soft-thresholding operator defined as . We assume that we pick sufficiently large, so one does not need to enforce the constraint explicitly.

When the loss  is convex and smooth with Lipschitz continuous gradient, the proximal gradient has a global convergence theory (see e.g. Nesterov (2004)). Convergence of the extended Algorithm 1 is analyzed in the following proposition.

Proposition 1.

Consider any monotonic algorithm for solving , i.e. (i) guarantees that and (ii) for any fixed , produces converging sequence of when solving . If is extended to solve (11) using partial minimization (10), the monotonic property is preserved, at least one limit point exists, and every limit point of the sequence is a stationary point of (4). Moreover, if is convex, and estimators over each feasible data selection have different optimal values, then converge in finitely many steps, and the extended algorithm converges to a local minimum333 is a local minimum of and is a global minimum of . of (4).

Finite convergence of the weights is an important point for practical implementation, since once the weights converge, one is essentially solving a single estimation problem, rather than a sequence of such problems. In particular, after finitely many steps, the extended algorithm inherits all properties of the original algorithm for the M-estimator over the selected data.

5 Simulated Data Experiments

We illustrate the generality of our approach by considering sparse logistic regression, trace-norm regularized multi-reponse regression and sparse GGMs (For experiments with sparse linear models, see

Alfons et al. (2013)).

5.1 Simulations for Sparse Logistic Regression

We begin with sparse logistic regression. We adopt an experimental protocol similar to Yang et al. (2013). We consider features. The parameter vectors have non-zero entries sampled i.i.d. from The data matrix is such that each of its

observations is sampled from a standard Normal distribution

Given each observation, we draw a true class label from following the logistic regression model. We show two scenarios, selecting either or samples with the highest amplitude of and flipping their labels. We compare the errors over 100 simulation runs of the new estimator with those of vanilla Lasso for logistic regression, and with two extended Lasso methods for logistic regression of Yang et al. (2013) (with “error in parameter” and in “error in output”) as the sample size increases. Figure 1 shows that the trimmed approach has both better performance (achieves lower errors), and is faster, matching the computational efficiency of the vanilla Lasso method. This result is anticipated by Proposition 1: the weights converge in finitely many steps, and then we are essentially solving the Lasso with a fixed weight set thereafter.

5.2 Simulations for Trace-Norm Regularized Regression

Beyond the penalty, we consider trace-norm regularized multi response regression. We set , for . We consider samples, covariates, and responses. Each entry of is generated independently from To generate the true low rank weights, we first sample a matrix of coefficients, with each coefficient sampled independently from . We then set the true parameter matrix to the best rank 3 approximation of the sample, obtained using an SVD. For clean samples in , we then set the error term as The contaminated terms are generated with an error term as We consider varying corruption levels ranging from to The parameters are tuned as in the previous section and we present the average error based on 100 simulation runs. Figure 2 further illustrates the computational advantage of the partial minimization scheme described in Section 4 for general structures.

5.3 Simulations for Gaussian Graphical Models

We compare the Trimmed Graphical Lasso (trim-glasso) algorithm against the vanilla Graphical Lasso(glasso) Friedman et al. (2007); the t-lasso and t*-lasso methods Finegold and Drton (2011), and robust-LL: the robustified-likelihood approach of Sun and Li (2012).

Our simulation setup is similar to Sun and Li (2012) and is a akin to gene regulatory networks. Namely we consider four different scenarios where the outliers are generated from models with different graphical structures. Specifically, each sample is generated from the following mixture distribution:

 yk∼(1−p0)Np(0,θ−1)+p02Np(−μ,θ−1o)+p02Np(μ,θ−1o),  k=1,…,n,

where and . Four different outlier distributions are considered:

• M1: ,  M2: ,

• M3: ,  M4: .

For each simulation run, is a randomly generated precision matrix corresponding to a network with hub nodes simulated as follows. Let be the adjacency of the network. For all we set with probability 0.03, and zero otherwise. We set We then randomly select 9 hub nodes and set the elements of the corresponding rows and columns of to one with probability 0.4 and zero otherwise. Using , the simulated nonzero coefficients of the precision matrix are sampled as follows. First we create a matrix so that if , and is sampled uniformly from if Then we set Finally we set where is the smallest eigenvalue of is a randomly generated precision matrix in the same way is generated.

For the robustness parameter of the robust-LL method, we consider as recommended in Sun and Li (2012). For the trim-glasso method we consider Since all the robust comparison methods converge to a stationary point, we tested various initialization strategies for the concentration matrix, including , and the estimate from glasso. We did not observe any noticeable impact on the results.

Figure 3 presents the average ROC curves of the comparison methods over 100 simulation data sets for scenarios M1-M4 as the tuning parameter varies. In the figure, for robust-LL and trim-glasso methods, we depict the best curves with respect to parameter and respectively. The detailed results for all the values of and considered are provided in the appendix.

From the ROC curves we can see that our proposed approach is competitive compared the alternative robust approaches t-lasso, t*-lasso and robust-LL. The edge over glasso is even more pronounced for scenarios M2, M4. Surprisingly, trim-glasso with achieves superior sensitivity for nearly any specificity.

Computationally the trim-glasso method is also competitive compared to alternatives. The average run-time over the path of tuning parameters is 45.78s for t-lasso, 22.14s for t*-lasso, 11.06s for robust-LL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single computing node with a Intel Core i5 2.5GHz CPU and 8G memory. For t-lasso, t*-lasso and robust-LL we used the R implementations provided by the methods’ authors. For glasso we used the glassopath package.

6 Application Genomic Analysis

6.1 Analysis of Yeast Genotype and Expression data

We apply Sparse-LTS , Extended Lasso (Nguyen and Tran 2013), LAD Lasso (Wang et al. 2007), standard Least Squares Lasso estimator (Tibshirani 1996), and ROMP (Chen et al. 2013) to the analysis of yeast genotype and gene expression data. We employ the “yeast” dataset from Brem et al. (2005). The data set concerns F1 segregants from a yeast genetic cross between two strains: BY and RM. For each of these 112 samples, we observe SNPs (These genotype data are our predictors ) and focus on the gene expression of gene GPA1 (our response ), which is involved in pheromone response Brem et al. (2005). For both Sparse-LTS-Ada and Sparse LTS considering a total of contaminated observations lead to the best predictive performance on the uncontaminated data. In addition, the QQ-plots of the fitted residuals from the various comparison methods indicated heavy left tails (see Figure 4). This suggests that it might be advisable to use robust methods.

We compare the trimmed mean square error (T-MSE) computed from 10-folds cross validation for each method, where for each method we exclude the 11 observations with largest residual absolute error. From Table 3 we can see thatSparse-LTS exhibit the smallest T-MSE.

We conclude by examining the SNPs selected by the methods achieving the lowest T-MSE: Sparse-LTS and LAD Lasso. Out of SNPs, Sparse-LTS selected 30 SNPs, and LAD Lasso chose 61 SNPs. Table 3 provides a list of the SNPs selected on chromosome 8, which is where gene GPA1 resides. In the dataset, there is a total of 166 SNPs on chromosome 8. From the table we can see that there is some overlap in terms of the selected SNPs across the various methods. Sparse-LTS tends to select a larger number of SNPs on chromosome 8 even though it selects fewer SNPs in total (namely within and beyond chromosome 8). Five of these are very close to GPA1 which is consistent with the fact that GPA1 can directly inhibit the mating signal by binding to its own subunit Stratton et al. (1996).

6.2 Application to the analysis of Yeast Gene Expression Data

We analyze a yeast microarray dataset generated by Brem and Kruglyak (2005). The dataset concerns yeast segregants (instances). We focused on genes (variables) belonging to cell-cycle pathway as provided by the KEGG database Kanehisa et al. (2014)

. For each of these genes we standardize the gene expression data to zero-mean and unit standard deviation. We observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers. For example the histogram of gene

ORC3 is presented in Figure 5(a).

For the robust-LL method we set and for trim-glasso we use We use 5-fold-CV to choose the tuning parameters for each method. After is chosen for each method, we rerun the methods using the full dataset to obtain the final precision matrix estimates.

Figure 5(b) shows the cell-cycle pathway estimated by our proposed method. For comparison the cell-cycle pathway from the KEGG Kanehisa et al. (2014) is provided in Figure 6.

It is important to note that the KEGG graph corresponds to what is currently known about the pathway. It should not be treated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also be caused by inherent limitations in the dataset used for modeling. For instance, some edges in cell-cycle pathway may not be observable from gene expression data. Additionally, the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links.

glasso tends to estimate more links than the robust methods. We postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links. Robust methods tend to estimate networks that are more consistent with that from the KEGG (-score of 0.23 for glasso, 0.37 for t*-lasso, 0.39 for robust-NLL and 0.41 for trim-glasso, where the

score is the harmonic mean between precision and recall). For instance our approach recovers several characteristics of the KEGG pathway. For instance, genes

CDC6 (a key regulator of DNA replication playing important roles in the activation and maintenance of the checkpoint mechanisms coordinating S phase and mitosis) and PDS1 (essential gene for meiotic progression and mitotic cell cycle arrest) are identified as a hub genes, while genes CLB3,BRN1,YCG1 are unconnected to any other genes.

7 Concluding Remarks

We presented a family of trimmed estimators for a wide class of structured high-dimensional problems. We provided general results on their statistical convergence rates and consistency. In particular our results for sparse linear regression and gaussian graphical models allow to precisely characterize the impact of corruptions on the statistical performance of the resulting estimatiors, while recovering the rates of their ‘untrimmed’ counterparts under clean data. We showed how to efficiently adapt existing optimization algorithms to solve the modified trimmed problems. Relevant directions for future work include specializing our theoretical analysis to generalized linear models, applying and analyzing trimmed approaches for more general structural regularizations, and the study of concomittent selection of the amount of trimming.

References

• Alfons et al. (2013) Alfons, A., Croux, C., and Gelper, S. (2013), “Sparse least trimmed squares regression for analyzing high-dimensional large data sets,” Ann. Appl. Stat., 7, 226–248.
• Aravkin and Van Leeuwen (2012) Aravkin, A. Y. and Van Leeuwen, T. (2012), “Estimating nuisance parameters in inverse problems,” Inverse Problems, 28, 115016.
• Bannerjee et al. (2008) Bannerjee, O., , Ghaoui, L. E., and d’Aspremont, A. (2008), “Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data,” Jour. Mach. Lear. Res., 9, 485–516.
• Beck and Teboulle (2009) Beck, A. and Teboulle, M. (2009), “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, 2, 183–202.
• Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004), Convex optimization, Cambridge, UK: Cambridge University Press.
• Brem and Kruglyak (2005) Brem, R. B. and Kruglyak, L. (2005), “The landscape of genetic complexity across 5,700 gene expression traits in yeast,” Proceedings of the National Academy of Sciences of the United States of America, 102, 1572–1577.
• Brem et al. (2005) Brem, R. B., Storey, J. D., Whittle, J., and Kruglyak, L. (2005), “Genetic interactions between polymorphisms that affect gene expression in yeast.” Nature, 436, 701–703.
• Bunea (2008) Bunea, F. (2008), “Honest variable selection in linear and logistic regression models via l1 and l1 + l2 penalization,” Electron. J. Stat., 2, 1153–1194.
• Candès et al. (2006) Candès, E., Romberg, J., and Tao, T. (2006), “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, 59, 1207–1223.
• Chen et al. (2013) Chen, Y., Caramanis, C., and Mannor, S. (2013), “Robust High Dimensional Sparse Regression and Matching Pursuit,”

The Proceedings of the International Conference on Machine Learning (ICML)

.
• Cross and Jain (1983) Cross, G. and Jain, A. (1983), “Markov Random Field Texture Models,” IEEE Trans. PAMI, 5, 25–39.
• Daye et al. (2012)

Daye, Z., Chen, J., and H., L. (2012), “High-Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis,”

Biometrics, 68, 316–326.
• Finegold and Drton (2011) Finegold, M. and Drton, M. (2011), “Robust graphical modeling of gene networks using classical and alternative T-distributions,” The Annals of Applied Statistics, 5, 1057–1080.
• Friedman et al. (2007) Friedman, J., Hastie, T., and Tibshirani, R. (2007), “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics.
• Golub and Pereyra (2003) Golub, G. and Pereyra, V. (2003), “Separable nonlinear least squares: the variable projection method and its applications,” Inverse Problems, 19, R1–R26.
• Hassner and Sklansky (1978) Hassner, M. and Sklansky, J. (1978), “Markov Random Field Models of Digitized Image Texture,” in ICPR78, pp. 538–540.
• Ising (1925) Ising, E. (1925), “Beitrag zur Theorie der Ferromagnetismus,” Zeitschrift für Physik, 31, 253–258.
• Kanehisa et al. (2014) Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2014), “Data, information, knowledge and principle: back to metabolism in KEGG,” Nucleic Acids Res., 42, D199–D205.
• Lauritzen (1996) Lauritzen, S. (1996), Graphical models, Oxford University Press, USA.
• Loh and Wainwright (2015) Loh, P. and Wainwright, M. J. (2015), “Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima,” Journal of Machine Learning Research (JMLR), 16, 559–616.
• Loh and Wainwright (2013) Loh, P.-L. and Wainwright, M. J. (2013), “Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” in Neur. Info. Proc. Sys. (NIPS), 26.
• Manning and Schutze (1999) Manning, C. D. and Schutze, H. (1999), Foundations of Statistical Natural Language Processing, MIT Press.
• Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006), “High-dimensional graphs and variable selection with the Lasso,” Annals of Statistics, 34, 1436–1462.
• Negahban et al. (2012) Negahban, S., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” Statistical Science, 27, 538–557.
• Nesterov (2004) Nesterov, Y. (2004), Introductory lectures on convex optimization, vol. 87 of Applied Optimization, Kluwer Academic Publishers, Boston, MA, a basic course.
• Nguyen and Tran (2013) Nguyen, N. H. and Tran, T. D. (2013), “Robust Lasso with missing and grossly corrupted observations,” IEEE Trans. Info. Theory, 59, 2036–2058.
• Oh and Deasy (2014) Oh, J. H. and Deasy, J. O. (2014), “Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm,” BMC Bioinformatics, 15, S5.
• Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010), “Restricted Eigenvalue Properties for Correlated Gaussian Designs,” Journal of Machine Learning Research (JMLR), 99, 2241–2259.
• Ravikumar et al. (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011), “High-dimensional covariance estimation by minimizing -penalized log-determinant divergence,” Electronic Journal of Statistics, 5, 935–980.
• Recht et al. (2010) Recht, B., Fazel, M., and Parrilo, P. A. (2010), “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM review, 52, 471–501.
• Ripley (1981) Ripley, B. D. (1981), Spatial statistics, New York: Wiley.
• Rockafellar and Wets (2009) Rockafellar, R. T. and Wets, R. J.-B. (2009), Variational analysis, vol. 317, Springer Science & Business Media.
• Rousseeuw (1984) Rousseeuw, P. J. (1984), “Least median of squares regression,” J. Amer. Statist. Assoc., 79, 871–880.
• Stratton et al. (1996) Stratton, H., Zhou, J., Reed, S., and Stone, D. (1996), “The Mating-Specific Galpha Protein of Saccharomyces cerevisiae Downregulates the Mating Signal by a Mechanism That Is Dependent on Pheromone and Independent of Gbetagamma Sequestration,” Molecular and Cellular Biology.
• Sun and Li (2012) Sun, H. and Li, H. (2012), “Robust Gaussian graphical modeling via l1 penalization,” Biometrics, 68, 1197–206.
• Tibshirani and Manning (2014) Tibshirani, J. and Manning, C. D. (2014), “Robust Logistic Regression using Shift Parameters.” in ACL (2), pp. 124–129.
• Tibshirani (1996) Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
• Tu et al. (2016) Tu, N., Aravkin, A., van Leeuwen, T., Lin, T., and Herrmann, F. J. (2016), “Source estimation with surface-related multiples—fast ambiguity-resolved seismic imaging,” Geophysical Journal International, 205, 1492–1511.
• van de Geer and Buhlmann (2009) van de Geer, S. and Buhlmann, P. (2009), “On the conditions used to prove oracle results for the Lasso,” Electronic Journal of Statistics, 3, 1360–1392.
• Vershynin (2012)

Vershynin, R. (2012), “Introduction to the non-asymptotic analysis of random matrices,” in

Compressed Sensing: Theory and Applications, eds. Eldar, Y. and Kutyniok, G., Cambridge University Press, pp. 210–268, forthcoming.
• Wainwright (2009) Wainwright, M. J. (2009), “Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso),” IEEE Trans. Information Theory, 55, 2183–2202.
• Wang et al. (2007) Wang, H., Li, G., and Jiang, G. (2007), “Robust regression shrinkage and consistent variable selection through the LAD-lasso,” Journal of Business and Economics Statistics, 25, 347–355.
• Woods (1978) Woods, J. (1978), “Markov Image Modeling,” IEEE Transactions on Automatic Control, 23, 846–850.
• Yang and Lozano (2015) Yang, E. and Lozano, A. C. (2015), “Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso,” in Neur. Info. Proc. Sys. (NIPS), 28.
• Yang and Ravikumar (2013) Yang, E. and Ravikumar, P. (2013), “Dirty Statistical Models,” in Neur. Info. Proc. Sys. (NIPS), 26.
• Yang et al. (2012) Yang, E., Ravikumar, P., Allen, G. I., and Liu, Z. (2012), “Graphical Models via Generalized Linear Models,” in Neur. Info. Proc. Sys. (NIPS), 25.
• Yang et al. (2013) Yang, E., Tewari, A., and Ravikumar, P. (2013), “On Robust Estimation of High Dimensional Generalized Linear Models,” in

Inter. Joint Conf. on Artificial Intelligence

, 13.
• Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007), “Model selection and estimation in the Gaussian graphical model,” Biometrika, 94, 19–35.

Appendix A Proof of Theorem 1

We use the shorthand for local optimal error vector: and where is an arbitrary local optimum of -estimator of (4). Our proof mainly uses the fact that is a local minimum of (4) satisfying

 ⟨∇θL(θ∗+˜Δ,w∗+˜Γ),˜θ−θ⟩≤−⟨∂λ∥θ∗+˜Δ∥1,˜θ−θ⟩for any feasible θ.

This inequality comes from the first order stationary condition (see Loh and Wainwright (2015) for details) in terms of only fixing at . In order to provide the complete proof of the theorem, we need to define the set of notations on the model space, perturbation space and corresponding projections following Negahban et al. (2012). The sparse LTS (3) is a typical example of (4), and such notations can be naturally defined based on the true support set . In this proof, we specifically focus on the case with for notational simplicity, but statements here can be seamlessly extendible for the general regularizer and the appropriately defined model/perturbation spaces.

If we take above, we have

 ≤ λ(∥θ∗∥1+∥˜ΔSc∥1−∥˜ΔSc∥1−∥˜θ∥1)=λ(∥θ∗+˜ΔSc∥1−∥˜ΔSc∥1−∥˜θ∥1) (ii)≤ λ(∥θ∗+˜ΔSc+˜ΔS∥1+∥˜ΔS∥1−∥˜ΔSc∥1−∥˜θ∥1)=λ(∥˜ΔS∥1−∥˜ΔSc∥1), (12)

where is true support set of , the inequalities and hold by respectively the convexity and the triangular inequality of norm.

Now, by the RSC condition in 1, we obtain

 κl∥˜Δ∥22−τ1(n,p)∥˜Δ∥21 ≤ ⟨∇θL(θ∗+˜Δ,w∗)−∇θL(θ∗,w∗),˜Δ⟩ =

which is equivalent with

 κl∥˜Δ∥22−τ1(n,p)∥˜Δ∥21+⟨∇θL(θ∗+˜Δ,w∗+˜Γ)−∇θL(θ∗+˜Δ,w∗),˜Δ⟩ ≤ ⟨