Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso

10/28/2015 ∙ by Eunho Yang, et al. ∙ 0

Gaussian Graphical Models (GGMs) are popular tools for studying network structures. However, many modern applications such as gene network discovery and social interactions analysis often involve high-dimensional noisy data with outliers or heavier tails than the Gaussian distribution. In this paper, we propose the Trimmed Graphical Lasso for robust estimation of sparse GGMs. Our method guards against outliers by an implicit trimming mechanism akin to the popular Least Trimmed Squares method used for linear regression. We provide a rigorous statistical analysis of our estimator in the high-dimensional setting. In contrast, existing approaches for robust sparse GGMs estimation lack statistical guarantees. Our theoretical results are complemented by experiments on simulated and real gene expression data which further demonstrate the value of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian graphical models (GGMs) form a powerful class of statistical models for representing distributions over a set of variables [1]. These models employ undirected graphs to encode conditional independence assumptions among the variables, which is particularly convenient for exploring network structures. GGMs are widely used in variety of domains, including computational biology [2]

, natural language processing 

[3], image processing [4, 5, 6], statistical physics [7], and spatial statistics [8]. In many modern applications, the number of variables can exceed the number of observations For instance, the number of genes in microarray data is typically larger than the sample size. In such high-dimensional settings, sparsity constraints are particularly pertinent for estimating GGMs, as they encourage only a few parameters to be non-zero and induce graphs with few edges. The most widely used estimator among others (see e.g. [9]) minimizes the Gaussian negative log-likelihood regularized by the norm of the entries (or the off-diagonal entries) of the precision matrix (see [10, 11, 12]). This estimator enjoys strong statistical guarantees (see e.g. [13]). The corresponding optimization problem is a log-determinant program that can be solved with interior point methods [14] or by co-ordinate descent algorithms [11, 12]. Alternatively neighborhood selection [15, 16] can be employed to estimate conditional independence relationships separately for each node in the graph, via Lasso linear regression, [17]. Under certain assumptions, the sparse GGM structure can still be recovered even under high-dimensional settings.

The aforementioned approaches rest on a fundamental assumption: the multivariate normality of the observations. However, outliers and corruption are frequently encountered in high-dimensional data (see e.g. 

[18] for gene expression data). Contamination of a few observations can drastically affect the quality of model estimation. It is therefore imperative to devise procedures that can cope with observations deviating from the model assumption. Despite this fact, little attention has been paid to robust estimation of high-dimensional graphical models. Relevant work includes [19], which leverages multivariate -distributions for robustified inference and the EM algorithm. They also propose an alternative -model which adds flexibility to the classical but requires the use of Monte Carlo EM or variational approximation as the likelihood function is not available explicitly. Another pertinent work is that of [20] which introduces a robustified likelihood function. A two-stage procedure is proposed for model estimation, where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently re-estimated using iterative proportional fitting so as to guarantee positive definiteness of the final estimate.

In this paper, we propose the Trimmed Graphical Lasso method for robust Gaussian graphical modeling in the sparse high-dimensional setting. Our approach is inspired by the classical Least Trimmed Squares method used for robust linear regression [21], in the sense that it disregards the observations that are judged less reliable. More specifically the Trimmed Graphical Lasso seeks to minimize a weighted version of the negative log-likelihood regularized by the penalty on the concentration matrix for the GGM and under some simple constraints on the weights. These weights implicitly induce the trimming of certain observations. Our key contributions can be summarized as follows.

  • We introduce the Trimmed Graphical Lasso formulation, along with two strategies for solving the objective. One involves solving a series of graphical lasso problems; the other is more efficient and leverages composite gradient descent in conjunction with partial optimization.

  • As our key theoretical contribution, we provide statistical guarantees on the consistency of our estimator. To the best of our knowledge, this is in stark contrast with prior work on robust sparse GGM estimation (e.g.  [19, 20]) which do not provide any statistical analysis.

  • Experimental results under various data corruption scenarios further demonstrate the value of our approach.

2 Problem Setup and Robust Gaussian Graphical Models

Notation. For matrices and , denotes the trace inner product . For a matrix and parameter , denotes the element-wise norm, and does the element-wise norm only for off-diagonal entries. For example, . Finally, we use and to denote the Frobenius and spectral norms, respectively.

Setup. Let be a zero-mean Gaussian random field parameterized by concentration matrix :

(1)

where

is the log-partition function of Gaussian random field. Here, the probability density function in (

1) is associated with -variate Gaussian distribution, where .

Given i.i.d. samples from high-dimensional Gaussian random field (1), the standard way to estimate the inverse covariance matrix is to solve the regularized maximum likelihood estimator (MLE) that can be written as the following regularized log-determinant program:

(2)

where is the space of the symmetric positive definite matrices, and is a regularization parameter that encourages a sparse graph model structure.

In this paper, we consider the case where the number of random variables

may be substantially larger than the number of sample size , however, the concentration parameter of the underlying distribution is sparse:

  1. [leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]

  2. The number of non-zero off-diagonal entries of is at most , that is .

Now, suppose that samples are drawn from this underlying distribution (1) with true parameter . We further allow some samples are corrupted and not drawn from (1). Specifically, the set of sample indices is separated into two disjoint subsets: if -th sample is in the set of “good” samples, which we name , then it is a genuine sample from (1) with the parameter . On the other hand, if the -th sample is in the set of “bad” samples, , the sample is corrupted. The identifications of and are hidden to us. However, we naturally assume that only a small number of samples are corrupted:

  1. [leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]

  2. Let be the number of good samples: and hence . Then, we assume that larger portion of samples are genuine and uncorrupted so that where . If we assume that 40% of samples are corrupted, then .

In later sections, we will derive a robust estimator for corrupted samples of sparse Gaussian graphical models and provide statistical guarantees of our estimator under the conditions 1 and 1.

2.1 Trimmed Graphical Lasso

We now propose a Trimmed Graphical Lasso for robust estimation of sparse GGMs:

(3)

where is a regularization parameter to decide the sparsity of our estimation, and is another parameter, which decides the number of samples (or sum of weights) used in the training. is ideally set as the number of uncorrupted samples in , but practically we can tune the parameter by cross-validation. Here, the constraint is required to analyze this non-convex optimization problem as discussed in [22]. For another tuning parameter , any positive real value would be sufficient for as long as . Finally note that when is fixed as (and is set as infinity), the optimization problem (3) will be simply reduced to the vanilla regularized MLE for sparse GGM without concerning outliers.

The optimization problem (3) is convex in as well as in , however this is not the case jointly. Nevertheless, we will show later that any local optimum of (3) is guaranteed to be strongly consistent under some fairly mild conditions.

  Initialize (e.g. )
  repeat
     Compute given , by assigning a weight of one to the observations with lowest negative log-likelihood and a weight of zero to the remaining ones.
     
     Line search. Choose (See Nesterov (2007) for a discussion of how the stepsize may be chosen), checking that the following update maintains positive definiteness. This can be verified via Cholesky factorization (as in [23]).
     Update. where is the soft-thresholding operator: and is only applied to the off-diagonal elements of matrix
     Compute reusing the Cholesky factor.
  until stopping criterion is satisfied
Algorithm 1 Trimmed Graphical Lasso in (3)

Optimization

As we briefly discussed above, the problem (3) is not jointly convex but biconvex. One possible approach to solve the objective of (3) thus is to alternate between solving for with fixed and solving for with fixed Given solving for is straightforward and boils down to assigning a weight of one to the observations with lowest negative log-likelihood and a weight of zero to the remaining ones. Given , solving for can be accomplished by any algorithm solving the “vanilla” graphical Lasso program, e.g. [11, 12]. Each step solves a convex problem hence the objective is guaranteed to decrease at each iteration and will converge to a local minima.

A more efficient optimization approach can be obtained by adopting a partial minimization strategy for : rather than solving to completion for each time is updated, one performs a single step update. This approach stems from considering the following equivalent reformulation of our objective:

(4)

On can then leverage standard first-order methods such as projected and composite gradient descent [24] that will converge to local optima. The overall procedure is depicted in Algorithm 1. Therein we assume that we pick sufficiently large, so one does not need to enforce the constraint explicitly. If needed the constraint can be enforced by an additional projection step [22].

3 Statistical Guarantees of Trimmed Graphical Lasso

One of the main contributions of this paper is to provide the statistical guarantees of our Trimmed Graphical Lasso estimator for GGMs. The optimization problem (3) is non-convex, and therefore the gradient-type methods solving (3) will find estimators by local minima. Hence, our theory in this section provides the statistical error bounds on any local minimum measured by and norms simultaneously.

Suppose that we have some local optimum of (3) by arbitrary gradient-based method. While is fixed unconditionally, we define as follows: for a sample index , is simply set to so that . Otherwise for a sample index , we set . Hence, is dependent on .

In order to derive the upper bound on the Frobenius norm error, we first need to assume the standard restricted strong convexity condition of (3) with respective to the parameter :

  1. [leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]

  2. (Restricted strong convexity condition) Let be an arbitrary error of parameter . That is, . Then, for any possible error such that ,

    (5)

    where is a curvature parameter.

Note that in order to guarantee the Frobenius norm-based error bounds, 1 is required even for the vanilla Gaussian graphical models without outliers, which has been well studied by several works such as the following lemma:

Lemma 1 (Section B.4 of [22]).

For any such that ,

thus 1 holds with .

While 1 is a standard condition that is also imposed for the conventional estimators under clean set of of samples, we additionally require the following condition for a successful estimation of (3) on corrupted samples:

  1. [leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]

  2. Consider arbitrary local optimum . Let and . Then,

    with some positive quantities and on and . These will be specified below for some concrete examples.

1 can be understood as a structural incoherence condition between the model parameter and the weight parameter . Such a condition is usually imposed when analyzing estimators with multiple parameters (for example, see [25] for a robust linear regression estimator). Since is defined depending on , each local optimum has its own 1 condition. We will see in the sequel that under some reasonable cases, this condition for any local optimum holds with high probability. Also note that for the case with clean samples, the condition 1 is trivially satisfied since for all and hence the LHS becomes 0.

Armed with these conditions, we now state our main theorem on the error bounds of our estimator (3):

Theorem 1.

Consider corrupted Gaussian graphical models. Let be an any local optimum of -estimator (3). Suppose that satisfies the condition 1. Suppose also that the regularization parameter in (3) is set such that

(6)

Then, this local optimum is guaranteed to be consistent as follows:

(7)

The statement in Theorem 1 holds deterministically, and the probabilistic statement comes where we show 1 and (6) for a given are satisfied. Note that, defining , it is a standard way of choosing based on (see [26], for details). Also it is important to note that the term captures the relation between element-wise norm and the error norm including diagonal entries. Due to the space limit, the proof of Theorem 1 (and all other proofs) are provided in the Supplements.

Now, it is natural to ask how easily we can satisfy the conditions in Theorem 1. Intuitively it is impossible to recover true parameter by weighting approach as in (3) when the amount of corruptions exceeds that of normal observation errors.

To this end, suppose that we have some upper bound on the corruptions:

  1. [leftmargin=0.1cm, itemindent=1.35cm,label=(C-B1), ref=(C-B1)]

  2. For some function , we have

where denotes the sub-design matrix in corresponding to outliers. Under this assumption, we can properly choose the regularization parameter satisfying (6) as follows:

Corollary 1.

Consider corrupted Gaussian graphical models with conditions 1 and 1. Suppose that we choose the regularization parameter

Then, any local optimum of (3) is guaranteed to satisfy 1 and have the error bounds in (1) with probability at least for some universal positive constants and .

If we further assume the number of corrupted samples scales with at most :

  1. [leftmargin=0.1cm, itemindent=1.35cm,label=(C-B2), ref=(C-B2)]

  2. for some constant ,

then we can derive the following result as another corollary of Theorem 1:

Corollary 2.

Consider corrupted Gaussian graphical models. Suppose that the conditions 1, 1 and 1 hold. Also suppose that the regularization parameter is set as where . Then, if the sample size is lower bounded as

then any local optimum of (3) is guaranteed to satisfy 1 and have the following error bound:

(8)

with probability at least for some universal positive constants and .

Note that the norm based error bound also can be easily derived using the selection of from (1). Corollary 2 reveals an interesting result: even when samples out of total samples are corrupted, our estimator (3) can successfully recover the true parameter with guaranteed error in (8). The first term in this bound is which exactly recovers the Frobenius error bound for the case without outliers (see [13, 22] for example). Due to the outliers, we have the performance degrade with the second term, which is . To the best of our knowledge, this is the first statistical error bounds on the parameter estimation for Gaussian graphical models with outliers. Also note that Corollary 1 only concerns on any local optimal point derived by an arbitrary optimization algorithm. For the guarantees of multiple local optima simultaneously, we may use a union bound from the corollary.

When Outliers Follow a Gaussian Graphical Model

Now let us provide a concrete example and show how in 1 is precisely specified in this case:

  1. [leftmargin=0.1cm, itemindent=1.35cm,label=(C-B3), ref=(C-B3)]

  2. Outliers in the set are drawn from another Gaussian graphical model (1) with a parameter .

This can be understood as the Gaussian mixture model where the most of the samples are drawn from

that we want to estimate, and small portion of samples are drawn from . In this case, Corollary 2 can be further shaped as follows:

Corollary 3.

Suppose that the conditions 1, 1 and 1 hold. Then the statement in Corollary 2 holds with .

4 Experiments

(a) M1
(b) M2
(c) M3
(d) M4
Figure 1: Average ROC curves for the comparison methods for contamination scenarios M1-M4.

In this section we corroborate the performance of our Trimmed Graphical Lasso (trim-glasso) algorithm on simulated data. We compare against glasso: the vanilla Graphical Lasso [11]; the t-Lasso and t*-lasso methods [19], and robust-LL: the robustified-likelihood approach of [20].

4.1 Simulated data

Our simulation setup is similar to [20] and is a akin to gene regulatory networks. Namely we consider four different scenarios where the outliers are generated from models with different graphical structures. Specifically, each sample is generated from the following mixture distribution:

where and . Four different outlier distributions are considered:

  • M1: ,  M2: ,

  • M3: ,  M4: .

We also consider the scenario where the outliers are not symmetric about the mean and simulate data from the following model

  • M5:

For each simulation run, is a randomly generated precision matrix corresponding to a network with hub nodes simulated as follows. Let be the adjacency of the network. For all we set with probability 0.03, and zero otherwise. We set We then randomly select 9 hub nodes and set the elements of the corresponding rows and columns of to one with probability 0.4 and zero otherwise. Using , the simulated nonzero coefficients of the precision matrix are sampled as follows. First we create a matrix so that if , and is sampled uniformly from if Then we set Finally we set where

is the smallest eigenvalue of

is a randomly generated precision matrix in the same way is generated.

For the robustness parameter of the robust-LL method, we consider as recommended in [20]. For the trim-glasso method we consider Since all the robust comparison methods converge to a stationary point, we tested various initialization strategies for the concentration matrix, including , and the estimate from glasso. We did not observe any noticeable impact on the results.

Figure 1 presents the average ROC curves of the comparison methods over 100 simulation data sets for scenarios M1-M4 as the tuning parameter varies. In the figure, for robust-LL and trim-glasso methods, we depict the best curves with respect to parameter and respectively. Due to space constraints, the detailed results for all the values of and considered, as well as the results for model M5 are provided in the Supplements.

From the ROC curves we can see that our proposed approach is competitive compared the alternative robust approaches t-lasso, t*-lasso and robust-LL. The edge over glasso is even more pronounced for scenarios M2, M4 and M5. Surprisingly, trim-glasso with achieves superior sensitivity for nearly any specificity.

Computationally the trim-glasso method is also competitive compared to alternatives. The average run-time over the path of tuning parameters is 45.78s for t-lasso, 22.14s for t*-lasso, 11.06s for robust-LL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single computing node with a Intel Core i5 2.5GHz CPU and 8G memory. For t-lasso, t*-lasso and robust-LL we used the R implementations provided by the methods’ authors. For glasso we used the glassopath package.

4.2 Application to the analysis of Yeast Gene Expression Data

We analyze a yeast microarray dataset generated by [27]. The dataset concerns yeast segregants (instances). We focused on genes (variables) belonging to cell-cycle pathway as provided by the KEGG database [28]

. For each of these genes we standardize the gene expression data to zero-mean and unit standard deviation. We observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers. For example the histogram of gene

ORC3 is presented in Figure 2(a).

Figure 2: (a) Histogram of standardized gene expression levels for gene ORC3. (b) Network estimated by trim-glasso

For the robust-LL method we set and for trim-glasso we use We use 5-fold-CV to choose the tuning parameters for each method. After is chosen for each method, we rerun the methods using the full dataset to obtain the final precision matrix estimates.

Figure 2(b) shows the cell-cycle pathway estimated by our proposed method. For comparison the cell-cycle pathway from the KEGG [28] is provided in the Supplements. It is important to note that the KEGG graph corresponds to what is currently known about the pathway. It should not be treated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also be caused by inherent limitations in the dataset used for modeling. For instance, some edges in cell-cycle pathway may not be observable from gene expression data. Additionally, the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links.

glasso tends to estimate more links than the robust methods. We postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links. Robust methods tend to estimate networks that are more consistent with that from the KEGG (-score of 0.23 for glasso, 0.37 for t*-lasso, 0.39 for robust-NLL and 0.41 for trim-glasso, where the

score is the harmonic mean between precision and recall). For instance our approach recovers several characteristics of the KEGG pathway. For instance, genes

CDC6 (a key regulator of DNA replication playing important roles in the activation and maintenance of the checkpoint mechanisms coordinating S phase and mitosis) and PDS1 (essential gene for meiotic progression and mitotic cell cycle arrest) are identified as a hub genes, while genes CLB3,BRN1,YCG1 are unconnected to any other genes.

References

  • Lauritzen [1996] S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.
  • Oh and Deasy [2014] Jung Hun Oh and Joseph O. Deasy. Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm. BMC Bioinformatics, 15(S-7):S5, 2014.
  • Manning and Schutze [1999] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
  • Woods [1978] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846–850, October 1978.
  • Hassner and Sklansky [1978] M. Hassner and J. Sklansky. Markov random field models of digitized image texture. In ICPR78, pages 538–540, 1978.
  • Cross and Jain [1983] G. Cross and A. Jain. Markov random field texture models. IEEE Trans. PAMI, 5:25–39, 1983.
  • Ising [1925] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik, 31:253–258, 1925.
  • Ripley [1981] B. D. Ripley. Spatial statistics. Wiley, New York, 1981.
  • Yang et al. [2014] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for graphical models. In Neur. Info. Proc. Sys. (NIPS), 27, 2014.
  • Yuan and Lin [2007] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • Friedman et al. [2007] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 2007.
  • Bannerjee et al. [2008] O. Bannerjee, , L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485–516, March 2008.
  • Ravikumar et al. [2011] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004.
  • Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436–1462, 2006.
  • Yang et al. [2012] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur. Info. Proc. Sys. (NIPS), 25, 2012.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
  • Daye et al. [2012] Z.J. Daye, J. Chen, and Li H.

    High-dimensional heteroscedastic regression with an application to eqtl data analysis.

    Biometrics, 68:316–326, 2012.
  • Finegold and Drton [2011] Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. The Annals of Applied Statistics, 5(2A):1057–1080, 2011.
  • Sun and Li [2012] H. Sun and H. Li. Robust Gaussian graphical modeling via l1 penalization. Biometrics, 68:1197–206, 2012.
  • Alfons et al. [2013] A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat., 7:226–248, 2013.
  • Loh and Wainwright [2013] P-L Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Arxiv preprint arXiv:1305.2436v2, 2013.
  • Hsieh et al. [2011] C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011.
  • Nesterov [2007] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain (UCL)., 2007.
  • Nguyen and Tran [2013] N. H. Nguyen and T. D. Tran. Robust Lasso with missing and grossly corrupted observations. IEEE Trans. Info. Theory, 59(4):2036–2058, 2013.
  • Negahban et al. [2012] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
  • Brem and Kruglyak [2005] Rachel B Brem and Leonid Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America, 102(5):1572–1577, 2005.
  • Kanehisa et al. [2014] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Res., 42:D199–D205, 2014.
  • Candes et al. [2006] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, August 2006.
  • Vershynin [2012] R. Vershynin.

    Introduction to the non-asymptotic analysis of random matrices.

    In Y. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. forthcoming.

Appendix

Appendix A Useful lemma(s)

Lemma 2 (Lemma 1 of [13]).

Suppose that are iid samples from with . Let be the event that

where is any constant greater than 2. Then, the probability of event occurring is at least .

Appendix B Proof of Theorem 1

Consider the optimal error and where is an arbitrary local optimum of -estimator (3). Our proof mainly uses the fact that is a local minimum of (3) satisfying

This inequality comes from the first order stationary condition (see [22] for details) in terms of only fixing at . Therefore, if we take above, we have

(9)

where is true support set of , the inequalities and hold by respectively the convexity and the triangular inequality of norm.

We first begin with the following lemma saying that , under the conditions of 1 and 1.

Lemma 3.

Suppose that the condition 1 and 1 hold. Moreover, as mentioned in the statement of Theorem 1. Then, for any local optimum , , and the inequality (5) holds for all possible local optimal errors .

Proof.

The Lemma 3 can be proved by the fact is a convex function. Hence, the function given by is also convex in , and for (see [22] for details).

Now, suppose that . Then, we have

(10)

Since , we can set so that . Hence, for , the condition 1 holds by assumption of the statement:

Combining with (B) yields

(11)

Now, from (B) and (11) followed by the condition 1 and Hölder’s inequity, we can obtain

By the choice of in the assumption of the statement and by the fact that and , we can rearrange the above inequality into

which conflicts with the assumption in the beginning of this proof. Hence, by contradiction, we can conclude under conditions in the statement. ∎

Now, by the RSC condition in 1 for , we obtain

By the condition 1

(I):

and at the same time, by (B),

Finally, by the assumption on and Hölder’s inequity,

Combining all pieces together yields

(12)

where is defined as the absolute sum of diagonal elements of a matrix, and . Therefore, (B) implies

At the same time in order to derive error bound, we again use the inequality in (B):

Hence,

which completes the proof.

Appendix C Proof of Corollary 1

In order to utilize Theorem 1, we simply need to specify the quantity and in 1. Toward this, we follow similar strategy as in [25]: given , we divide the index of into the disjoint exhaustive subsets of size such that contains largest absolute elements in , and so on. Then, we have