1 Introduction
Gaussian graphical models (GGMs) form a powerful class of statistical models for representing distributions over a set of variables [1]. These models employ undirected graphs to encode conditional independence assumptions among the variables, which is particularly convenient for exploring network structures. GGMs are widely used in variety of domains, including computational biology [2]
[3], image processing [4, 5, 6], statistical physics [7], and spatial statistics [8]. In many modern applications, the number of variables can exceed the number of observations For instance, the number of genes in microarray data is typically larger than the sample size. In such highdimensional settings, sparsity constraints are particularly pertinent for estimating GGMs, as they encourage only a few parameters to be nonzero and induce graphs with few edges. The most widely used estimator among others (see e.g. [9]) minimizes the Gaussian negative loglikelihood regularized by the norm of the entries (or the offdiagonal entries) of the precision matrix (see [10, 11, 12]). This estimator enjoys strong statistical guarantees (see e.g. [13]). The corresponding optimization problem is a logdeterminant program that can be solved with interior point methods [14] or by coordinate descent algorithms [11, 12]. Alternatively neighborhood selection [15, 16] can be employed to estimate conditional independence relationships separately for each node in the graph, via Lasso linear regression, [17]. Under certain assumptions, the sparse GGM structure can still be recovered even under highdimensional settings.The aforementioned approaches rest on a fundamental assumption: the multivariate normality of the observations. However, outliers and corruption are frequently encountered in highdimensional data (see e.g.
[18] for gene expression data). Contamination of a few observations can drastically affect the quality of model estimation. It is therefore imperative to devise procedures that can cope with observations deviating from the model assumption. Despite this fact, little attention has been paid to robust estimation of highdimensional graphical models. Relevant work includes [19], which leverages multivariate distributions for robustified inference and the EM algorithm. They also propose an alternative model which adds flexibility to the classical but requires the use of Monte Carlo EM or variational approximation as the likelihood function is not available explicitly. Another pertinent work is that of [20] which introduces a robustified likelihood function. A twostage procedure is proposed for model estimation, where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently reestimated using iterative proportional fitting so as to guarantee positive definiteness of the final estimate.In this paper, we propose the Trimmed Graphical Lasso method for robust Gaussian graphical modeling in the sparse highdimensional setting. Our approach is inspired by the classical Least Trimmed Squares method used for robust linear regression [21], in the sense that it disregards the observations that are judged less reliable. More specifically the Trimmed Graphical Lasso seeks to minimize a weighted version of the negative loglikelihood regularized by the penalty on the concentration matrix for the GGM and under some simple constraints on the weights. These weights implicitly induce the trimming of certain observations. Our key contributions can be summarized as follows.

We introduce the Trimmed Graphical Lasso formulation, along with two strategies for solving the objective. One involves solving a series of graphical lasso problems; the other is more efficient and leverages composite gradient descent in conjunction with partial optimization.

Experimental results under various data corruption scenarios further demonstrate the value of our approach.
2 Problem Setup and Robust Gaussian Graphical Models
Notation. For matrices and , denotes the trace inner product . For a matrix and parameter , denotes the elementwise norm, and does the elementwise norm only for offdiagonal entries. For example, . Finally, we use and to denote the Frobenius and spectral norms, respectively.
Setup. Let be a zeromean Gaussian random field parameterized by concentration matrix :
(1) 
where
is the logpartition function of Gaussian random field. Here, the probability density function in (
1) is associated with variate Gaussian distribution, where .Given i.i.d. samples from highdimensional Gaussian random field (1), the standard way to estimate the inverse covariance matrix is to solve the regularized maximum likelihood estimator (MLE) that can be written as the following regularized logdeterminant program:
(2) 
where is the space of the symmetric positive definite matrices, and is a regularization parameter that encourages a sparse graph model structure.
In this paper, we consider the case where the number of random variables
may be substantially larger than the number of sample size , however, the concentration parameter of the underlying distribution is sparse:
[leftmargin=0.1cm, itemindent=1.1cm,label=(C), ref=(C)]

The number of nonzero offdiagonal entries of is at most , that is .
Now, suppose that samples are drawn from this underlying distribution (1) with true parameter . We further allow some samples are corrupted and not drawn from (1). Specifically, the set of sample indices is separated into two disjoint subsets: if th sample is in the set of “good” samples, which we name , then it is a genuine sample from (1) with the parameter . On the other hand, if the th sample is in the set of “bad” samples, , the sample is corrupted. The identifications of and are hidden to us. However, we naturally assume that only a small number of samples are corrupted:

[leftmargin=0.1cm, itemindent=1.1cm,label=(C), ref=(C)]

Let be the number of good samples: and hence . Then, we assume that larger portion of samples are genuine and uncorrupted so that where . If we assume that 40% of samples are corrupted, then .
In later sections, we will derive a robust estimator for corrupted samples of sparse Gaussian graphical models and provide statistical guarantees of our estimator under the conditions 1 and 1.
2.1 Trimmed Graphical Lasso
We now propose a Trimmed Graphical Lasso for robust estimation of sparse GGMs:
(3)  
where is a regularization parameter to decide the sparsity of our estimation, and is another parameter, which decides the number of samples (or sum of weights) used in the training. is ideally set as the number of uncorrupted samples in , but practically we can tune the parameter by crossvalidation. Here, the constraint is required to analyze this nonconvex optimization problem as discussed in [22]. For another tuning parameter , any positive real value would be sufficient for as long as . Finally note that when is fixed as (and is set as infinity), the optimization problem (3) will be simply reduced to the vanilla regularized MLE for sparse GGM without concerning outliers.
The optimization problem (3) is convex in as well as in , however this is not the case jointly. Nevertheless, we will show later that any local optimum of (3) is guaranteed to be strongly consistent under some fairly mild conditions.
Optimization
As we briefly discussed above, the problem (3) is not jointly convex but biconvex. One possible approach to solve the objective of (3) thus is to alternate between solving for with fixed and solving for with fixed Given solving for is straightforward and boils down to assigning a weight of one to the observations with lowest negative loglikelihood and a weight of zero to the remaining ones. Given , solving for can be accomplished by any algorithm solving the “vanilla” graphical Lasso program, e.g. [11, 12]. Each step solves a convex problem hence the objective is guaranteed to decrease at each iteration and will converge to a local minima.
A more efficient optimization approach can be obtained by adopting a partial minimization strategy for : rather than solving to completion for each time is updated, one performs a single step update. This approach stems from considering the following equivalent reformulation of our objective:
(4)  
On can then leverage standard firstorder methods such as projected and composite gradient descent [24] that will converge to local optima. The overall procedure is depicted in Algorithm 1. Therein we assume that we pick sufficiently large, so one does not need to enforce the constraint explicitly. If needed the constraint can be enforced by an additional projection step [22].
3 Statistical Guarantees of Trimmed Graphical Lasso
One of the main contributions of this paper is to provide the statistical guarantees of our Trimmed Graphical Lasso estimator for GGMs. The optimization problem (3) is nonconvex, and therefore the gradienttype methods solving (3) will find estimators by local minima. Hence, our theory in this section provides the statistical error bounds on any local minimum measured by and norms simultaneously.
Suppose that we have some local optimum of (3) by arbitrary gradientbased method. While is fixed unconditionally, we define as follows: for a sample index , is simply set to so that . Otherwise for a sample index , we set . Hence, is dependent on .
In order to derive the upper bound on the Frobenius norm error, we first need to assume the standard restricted strong convexity condition of (3) with respective to the parameter :

[leftmargin=0.1cm, itemindent=1.1cm,label=(C), ref=(C)]

(Restricted strong convexity condition) Let be an arbitrary error of parameter . That is, . Then, for any possible error such that ,
(5) where is a curvature parameter.
Note that in order to guarantee the Frobenius normbased error bounds, 1 is required even for the vanilla Gaussian graphical models without outliers, which has been well studied by several works such as the following lemma:
Lemma 1 (Section B.4 of [22]).
For any such that ,
thus 1 holds with .
While 1 is a standard condition that is also imposed for the conventional estimators under clean set of of samples, we additionally require the following condition for a successful estimation of (3) on corrupted samples:

[leftmargin=0.1cm, itemindent=1.1cm,label=(C), ref=(C)]

Consider arbitrary local optimum . Let and . Then,
with some positive quantities and on and . These will be specified below for some concrete examples.
1 can be understood as a structural incoherence condition between the model parameter and the weight parameter . Such a condition is usually imposed when analyzing estimators with multiple parameters (for example, see [25] for a robust linear regression estimator). Since is defined depending on , each local optimum has its own 1 condition. We will see in the sequel that under some reasonable cases, this condition for any local optimum holds with high probability. Also note that for the case with clean samples, the condition 1 is trivially satisfied since for all and hence the LHS becomes 0.
Armed with these conditions, we now state our main theorem on the error bounds of our estimator (3):
Theorem 1.
The statement in Theorem 1 holds deterministically, and the probabilistic statement comes where we show 1 and (6) for a given are satisfied. Note that, defining , it is a standard way of choosing based on (see [26], for details). Also it is important to note that the term captures the relation between elementwise norm and the error norm including diagonal entries. Due to the space limit, the proof of Theorem 1 (and all other proofs) are provided in the Supplements.
Now, it is natural to ask how easily we can satisfy the conditions in Theorem 1. Intuitively it is impossible to recover true parameter by weighting approach as in (3) when the amount of corruptions exceeds that of normal observation errors.
To this end, suppose that we have some upper bound on the corruptions:

[leftmargin=0.1cm, itemindent=1.35cm,label=(CB1), ref=(CB1)]

For some function , we have
where denotes the subdesign matrix in corresponding to outliers. Under this assumption, we can properly choose the regularization parameter satisfying (6) as follows:
Corollary 1.
If we further assume the number of corrupted samples scales with at most :

[leftmargin=0.1cm, itemindent=1.35cm,label=(CB2), ref=(CB2)]

for some constant ,
then we can derive the following result as another corollary of Theorem 1:
Corollary 2.
Consider corrupted Gaussian graphical models. Suppose that the conditions 1, 1 and 1 hold. Also suppose that the regularization parameter is set as where . Then, if the sample size is lower bounded as
then any local optimum of (3) is guaranteed to satisfy 1 and have the following error bound:
(8) 
with probability at least for some universal positive constants and .
Note that the norm based error bound also can be easily derived using the selection of from (1). Corollary 2 reveals an interesting result: even when samples out of total samples are corrupted, our estimator (3) can successfully recover the true parameter with guaranteed error in (8). The first term in this bound is which exactly recovers the Frobenius error bound for the case without outliers (see [13, 22] for example). Due to the outliers, we have the performance degrade with the second term, which is . To the best of our knowledge, this is the first statistical error bounds on the parameter estimation for Gaussian graphical models with outliers. Also note that Corollary 1 only concerns on any local optimal point derived by an arbitrary optimization algorithm. For the guarantees of multiple local optima simultaneously, we may use a union bound from the corollary.
When Outliers Follow a Gaussian Graphical Model
Now let us provide a concrete example and show how in 1 is precisely specified in this case:

[leftmargin=0.1cm, itemindent=1.35cm,label=(CB3), ref=(CB3)]

Outliers in the set are drawn from another Gaussian graphical model (1) with a parameter .
This can be understood as the Gaussian mixture model where the most of the samples are drawn from
that we want to estimate, and small portion of samples are drawn from . In this case, Corollary 2 can be further shaped as follows:4 Experiments
In this section we corroborate the performance of our Trimmed Graphical Lasso (trimglasso) algorithm on simulated data. We compare against glasso: the vanilla Graphical Lasso [11]; the tLasso and t*lasso methods [19], and robustLL: the robustifiedlikelihood approach of [20].
4.1 Simulated data
Our simulation setup is similar to [20] and is a akin to gene regulatory networks. Namely we consider four different scenarios where the outliers are generated from models with different graphical structures. Specifically, each sample is generated from the following mixture distribution:
where and . Four different outlier distributions are considered:

M1: , M2: ,

M3: , M4: .
We also consider the scenario where the outliers are not symmetric about the mean and simulate data from the following model

M5:
For each simulation run, is a randomly generated precision matrix corresponding to a network with hub nodes simulated as follows. Let be the adjacency of the network. For all we set with probability 0.03, and zero otherwise. We set We then randomly select 9 hub nodes and set the elements of the corresponding rows and columns of to one with probability 0.4 and zero otherwise. Using , the simulated nonzero coefficients of the precision matrix are sampled as follows. First we create a matrix so that if , and is sampled uniformly from if Then we set Finally we set where
is the smallest eigenvalue of
is a randomly generated precision matrix in the same way is generated.For the robustness parameter of the robustLL method, we consider as recommended in [20]. For the trimglasso method we consider Since all the robust comparison methods converge to a stationary point, we tested various initialization strategies for the concentration matrix, including , and the estimate from glasso. We did not observe any noticeable impact on the results.
Figure 1 presents the average ROC curves of the comparison methods over 100 simulation data sets for scenarios M1M4 as the tuning parameter varies. In the figure, for robustLL and trimglasso methods, we depict the best curves with respect to parameter and respectively. Due to space constraints, the detailed results for all the values of and considered, as well as the results for model M5 are provided in the Supplements.
From the ROC curves we can see that our proposed approach is competitive compared the alternative robust approaches tlasso, t*lasso and robustLL. The edge over glasso is even more pronounced for scenarios M2, M4 and M5. Surprisingly, trimglasso with achieves superior sensitivity for nearly any specificity.
Computationally the trimglasso method is also competitive compared to alternatives. The average runtime over the path of tuning parameters is 45.78s for tlasso, 22.14s for t*lasso, 11.06s for robustLL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single computing node with a Intel Core i5 2.5GHz CPU and 8G memory. For tlasso, t*lasso and robustLL we used the R implementations provided by the methods’ authors. For glasso we used the glassopath package.
4.2 Application to the analysis of Yeast Gene Expression Data
We analyze a yeast microarray dataset generated by [27]. The dataset concerns yeast segregants (instances). We focused on genes (variables) belonging to cellcycle pathway as provided by the KEGG database [28]
. For each of these genes we standardize the gene expression data to zeromean and unit standard deviation. We observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers. For example the histogram of gene
ORC3 is presented in Figure 2(a).For the robustLL method we set and for trimglasso we use We use 5foldCV to choose the tuning parameters for each method. After is chosen for each method, we rerun the methods using the full dataset to obtain the final precision matrix estimates.
Figure 2(b) shows the cellcycle pathway estimated by our proposed method. For comparison the cellcycle pathway from the KEGG [28] is provided in the Supplements. It is important to note that the KEGG graph corresponds to what is currently known about the pathway. It should not be treated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also be caused by inherent limitations in the dataset used for modeling. For instance, some edges in cellcycle pathway may not be observable from gene expression data. Additionally, the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links.
glasso tends to estimate more links than the robust methods. We postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links. Robust methods tend to estimate networks that are more consistent with that from the KEGG (score of 0.23 for glasso, 0.37 for t*lasso, 0.39 for robustNLL and 0.41 for trimglasso, where the
score is the harmonic mean between precision and recall). For instance our approach recovers several characteristics of the KEGG pathway. For instance, genes
CDC6 (a key regulator of DNA replication playing important roles in the activation and maintenance of the checkpoint mechanisms coordinating S phase and mitosis) and PDS1 (essential gene for meiotic progression and mitotic cell cycle arrest) are identified as a hub genes, while genes CLB3,BRN1,YCG1 are unconnected to any other genes.References
 Lauritzen [1996] S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.
 Oh and Deasy [2014] Jung Hun Oh and Joseph O. Deasy. Inference of radioresponsive gene regulatory networks using the graphical lasso algorithm. BMC Bioinformatics, 15(S7):S5, 2014.
 Manning and Schutze [1999] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
 Woods [1978] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846–850, October 1978.
 Hassner and Sklansky [1978] M. Hassner and J. Sklansky. Markov random field models of digitized image texture. In ICPR78, pages 538–540, 1978.
 Cross and Jain [1983] G. Cross and A. Jain. Markov random field texture models. IEEE Trans. PAMI, 5:25–39, 1983.
 Ising [1925] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik, 31:253–258, 1925.
 Ripley [1981] B. D. Ripley. Spatial statistics. Wiley, New York, 1981.
 Yang et al. [2014] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for graphical models. In Neur. Info. Proc. Sys. (NIPS), 27, 2014.
 Yuan and Lin [2007] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
 Friedman et al. [2007] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 2007.
 Bannerjee et al. [2008] O. Bannerjee, , L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485–516, March 2008.
 Ravikumar et al. [2011] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004.
 Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann. Highdimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436–1462, 2006.
 Yang et al. [2012] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur. Info. Proc. Sys. (NIPS), 25, 2012.
 Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.

Daye et al. [2012]
Z.J. Daye, J. Chen, and Li H.
Highdimensional heteroscedastic regression with an application to eqtl data analysis.
Biometrics, 68:316–326, 2012.  Finegold and Drton [2011] Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative tdistributions. The Annals of Applied Statistics, 5(2A):1057–1080, 2011.
 Sun and Li [2012] H. Sun and H. Li. Robust Gaussian graphical modeling via l1 penalization. Biometrics, 68:1197–206, 2012.
 Alfons et al. [2013] A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing highdimensional large data sets. Ann. Appl. Stat., 7:226–248, 2013.
 Loh and Wainwright [2013] PL Loh and M. J. Wainwright. Regularized mestimators with nonconvexity: Statistical and algorithmic theory for local optima. Arxiv preprint arXiv:1305.2436v2, 2013.
 Hsieh et al. [2011] C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011.
 Nesterov [2007] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain (UCL)., 2007.
 Nguyen and Tran [2013] N. H. Nguyen and T. D. Tran. Robust Lasso with missing and grossly corrupted observations. IEEE Trans. Info. Theory, 59(4):2036–2058, 2013.
 Negahban et al. [2012] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of Mestimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
 Brem and Kruglyak [2005] Rachel B Brem and Leonid Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America, 102(5):1572–1577, 2005.
 Kanehisa et al. [2014] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Res., 42:D199–D205, 2014.
 Candes et al. [2006] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, August 2006.

Vershynin [2012]
R. Vershynin.
Introduction to the nonasymptotic analysis of random matrices.
In Y. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. forthcoming.
Appendix
Appendix A Useful lemma(s)
Lemma 2 (Lemma 1 of [13]).
Suppose that are iid samples from with . Let be the event that
where is any constant greater than 2. Then, the probability of event occurring is at least .
Appendix B Proof of Theorem 1
Consider the optimal error and where is an arbitrary local optimum of estimator (3). Our proof mainly uses the fact that is a local minimum of (3) satisfying
This inequality comes from the first order stationary condition (see [22] for details) in terms of only fixing at . Therefore, if we take above, we have
(9) 
where is true support set of , the inequalities and hold by respectively the convexity and the triangular inequality of norm.
Lemma 3.
Proof.
The Lemma 3 can be proved by the fact is a convex function. Hence, the function given by is also convex in , and for (see [22] for details).
Now, suppose that . Then, we have
(10) 
Since , we can set so that . Hence, for , the condition 1 holds by assumption of the statement:
Combining with (B) yields
(11) 
Now, from (B) and (11) followed by the condition 1 and Hölder’s inequity, we can obtain
By the choice of in the assumption of the statement and by the fact that and , we can rearrange the above inequality into
which conflicts with the assumption in the beginning of this proof. Hence, by contradiction, we can conclude under conditions in the statement. ∎
Now, by the RSC condition in 1 for , we obtain
By the condition 1
(I):  
and at the same time, by (B),
Finally, by the assumption on and Hölder’s inequity,
Combining all pieces together yields
(12) 
where is defined as the absolute sum of diagonal elements of a matrix, and . Therefore, (B) implies
At the same time in order to derive error bound, we again use the inequality in (B):
Hence,
which completes the proof.