Gaussian graphical models (GGMs) form a powerful class of statistical models for representing distributions over a set of variables . These models employ undirected graphs to encode conditional independence assumptions among the variables, which is particularly convenient for exploring network structures. GGMs are widely used in variety of domains, including computational biology 3], image processing [4, 5, 6], statistical physics , and spatial statistics . In many modern applications, the number of variables can exceed the number of observations For instance, the number of genes in microarray data is typically larger than the sample size. In such high-dimensional settings, sparsity constraints are particularly pertinent for estimating GGMs, as they encourage only a few parameters to be non-zero and induce graphs with few edges. The most widely used estimator among others (see e.g. ) minimizes the Gaussian negative log-likelihood regularized by the norm of the entries (or the off-diagonal entries) of the precision matrix (see [10, 11, 12]). This estimator enjoys strong statistical guarantees (see e.g. ). The corresponding optimization problem is a log-determinant program that can be solved with interior point methods  or by co-ordinate descent algorithms [11, 12]. Alternatively neighborhood selection [15, 16] can be employed to estimate conditional independence relationships separately for each node in the graph, via Lasso linear regression, . Under certain assumptions, the sparse GGM structure can still be recovered even under high-dimensional settings.
The aforementioned approaches rest on a fundamental assumption: the multivariate normality of the observations. However, outliers and corruption are frequently encountered in high-dimensional data (see e.g. for gene expression data). Contamination of a few observations can drastically affect the quality of model estimation. It is therefore imperative to devise procedures that can cope with observations deviating from the model assumption. Despite this fact, little attention has been paid to robust estimation of high-dimensional graphical models. Relevant work includes , which leverages multivariate -distributions for robustified inference and the EM algorithm. They also propose an alternative -model which adds flexibility to the classical but requires the use of Monte Carlo EM or variational approximation as the likelihood function is not available explicitly. Another pertinent work is that of  which introduces a robustified likelihood function. A two-stage procedure is proposed for model estimation, where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently re-estimated using iterative proportional fitting so as to guarantee positive definiteness of the final estimate.
In this paper, we propose the Trimmed Graphical Lasso method for robust Gaussian graphical modeling in the sparse high-dimensional setting. Our approach is inspired by the classical Least Trimmed Squares method used for robust linear regression , in the sense that it disregards the observations that are judged less reliable. More specifically the Trimmed Graphical Lasso seeks to minimize a weighted version of the negative log-likelihood regularized by the penalty on the concentration matrix for the GGM and under some simple constraints on the weights. These weights implicitly induce the trimming of certain observations. Our key contributions can be summarized as follows.
We introduce the Trimmed Graphical Lasso formulation, along with two strategies for solving the objective. One involves solving a series of graphical lasso problems; the other is more efficient and leverages composite gradient descent in conjunction with partial optimization.
Experimental results under various data corruption scenarios further demonstrate the value of our approach.
2 Problem Setup and Robust Gaussian Graphical Models
Notation. For matrices and , denotes the trace inner product . For a matrix and parameter , denotes the element-wise norm, and does the element-wise norm only for off-diagonal entries. For example, . Finally, we use and to denote the Frobenius and spectral norms, respectively.
Setup. Let be a zero-mean Gaussian random field parameterized by concentration matrix :
is the log-partition function of Gaussian random field. Here, the probability density function in (1) is associated with -variate Gaussian distribution, where .
Given i.i.d. samples from high-dimensional Gaussian random field (1), the standard way to estimate the inverse covariance matrix is to solve the regularized maximum likelihood estimator (MLE) that can be written as the following regularized log-determinant program:
where is the space of the symmetric positive definite matrices, and is a regularization parameter that encourages a sparse graph model structure.
In this paper, we consider the case where the number of random variablesmay be substantially larger than the number of sample size , however, the concentration parameter of the underlying distribution is sparse:
[leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]
The number of non-zero off-diagonal entries of is at most , that is .
Now, suppose that samples are drawn from this underlying distribution (1) with true parameter . We further allow some samples are corrupted and not drawn from (1). Specifically, the set of sample indices is separated into two disjoint subsets: if -th sample is in the set of “good” samples, which we name , then it is a genuine sample from (1) with the parameter . On the other hand, if the -th sample is in the set of “bad” samples, , the sample is corrupted. The identifications of and are hidden to us. However, we naturally assume that only a small number of samples are corrupted:
[leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]
Let be the number of good samples: and hence . Then, we assume that larger portion of samples are genuine and uncorrupted so that where . If we assume that 40% of samples are corrupted, then .
2.1 Trimmed Graphical Lasso
We now propose a Trimmed Graphical Lasso for robust estimation of sparse GGMs:
where is a regularization parameter to decide the sparsity of our estimation, and is another parameter, which decides the number of samples (or sum of weights) used in the training. is ideally set as the number of uncorrupted samples in , but practically we can tune the parameter by cross-validation. Here, the constraint is required to analyze this non-convex optimization problem as discussed in . For another tuning parameter , any positive real value would be sufficient for as long as . Finally note that when is fixed as (and is set as infinity), the optimization problem (3) will be simply reduced to the vanilla regularized MLE for sparse GGM without concerning outliers.
The optimization problem (3) is convex in as well as in , however this is not the case jointly. Nevertheless, we will show later that any local optimum of (3) is guaranteed to be strongly consistent under some fairly mild conditions.
As we briefly discussed above, the problem (3) is not jointly convex but biconvex. One possible approach to solve the objective of (3) thus is to alternate between solving for with fixed and solving for with fixed Given solving for is straightforward and boils down to assigning a weight of one to the observations with lowest negative log-likelihood and a weight of zero to the remaining ones. Given , solving for can be accomplished by any algorithm solving the “vanilla” graphical Lasso program, e.g. [11, 12]. Each step solves a convex problem hence the objective is guaranteed to decrease at each iteration and will converge to a local minima.
A more efficient optimization approach can be obtained by adopting a partial minimization strategy for : rather than solving to completion for each time is updated, one performs a single step update. This approach stems from considering the following equivalent reformulation of our objective:
On can then leverage standard first-order methods such as projected and composite gradient descent  that will converge to local optima. The overall procedure is depicted in Algorithm 1. Therein we assume that we pick sufficiently large, so one does not need to enforce the constraint explicitly. If needed the constraint can be enforced by an additional projection step .
3 Statistical Guarantees of Trimmed Graphical Lasso
One of the main contributions of this paper is to provide the statistical guarantees of our Trimmed Graphical Lasso estimator for GGMs. The optimization problem (3) is non-convex, and therefore the gradient-type methods solving (3) will find estimators by local minima. Hence, our theory in this section provides the statistical error bounds on any local minimum measured by and norms simultaneously.
Suppose that we have some local optimum of (3) by arbitrary gradient-based method. While is fixed unconditionally, we define as follows: for a sample index , is simply set to so that . Otherwise for a sample index , we set . Hence, is dependent on .
In order to derive the upper bound on the Frobenius norm error, we first need to assume the standard restricted strong convexity condition of (3) with respective to the parameter :
[leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]
(Restricted strong convexity condition) Let be an arbitrary error of parameter . That is, . Then, for any possible error such that ,
where is a curvature parameter.
Note that in order to guarantee the Frobenius norm-based error bounds, 1 is required even for the vanilla Gaussian graphical models without outliers, which has been well studied by several works such as the following lemma:
Lemma 1 (Section B.4 of ).
For any such that ,
thus 1 holds with .
While 1 is a standard condition that is also imposed for the conventional estimators under clean set of of samples, we additionally require the following condition for a successful estimation of (3) on corrupted samples:
[leftmargin=0.1cm, itemindent=1.1cm,label=(C-), ref=(C-)]
Consider arbitrary local optimum . Let and . Then,
with some positive quantities and on and . These will be specified below for some concrete examples.
1 can be understood as a structural incoherence condition between the model parameter and the weight parameter . Such a condition is usually imposed when analyzing estimators with multiple parameters (for example, see  for a robust linear regression estimator). Since is defined depending on , each local optimum has its own 1 condition. We will see in the sequel that under some reasonable cases, this condition for any local optimum holds with high probability. Also note that for the case with clean samples, the condition 1 is trivially satisfied since for all and hence the LHS becomes 0.
Armed with these conditions, we now state our main theorem on the error bounds of our estimator (3):
The statement in Theorem 1 holds deterministically, and the probabilistic statement comes where we show 1 and (6) for a given are satisfied. Note that, defining , it is a standard way of choosing based on (see , for details). Also it is important to note that the term captures the relation between element-wise norm and the error norm including diagonal entries. Due to the space limit, the proof of Theorem 1 (and all other proofs) are provided in the Supplements.
Now, it is natural to ask how easily we can satisfy the conditions in Theorem 1. Intuitively it is impossible to recover true parameter by weighting approach as in (3) when the amount of corruptions exceeds that of normal observation errors.
To this end, suppose that we have some upper bound on the corruptions:
[leftmargin=0.1cm, itemindent=1.35cm,label=(C-B1), ref=(C-B1)]
For some function , we have
where denotes the sub-design matrix in corresponding to outliers. Under this assumption, we can properly choose the regularization parameter satisfying (6) as follows:
If we further assume the number of corrupted samples scales with at most :
[leftmargin=0.1cm, itemindent=1.35cm,label=(C-B2), ref=(C-B2)]
for some constant ,
then we can derive the following result as another corollary of Theorem 1:
Consider corrupted Gaussian graphical models. Suppose that the conditions 1, 1 and 1 hold. Also suppose that the regularization parameter is set as where . Then, if the sample size is lower bounded as
with probability at least for some universal positive constants and .
Note that the norm based error bound also can be easily derived using the selection of from (1). Corollary 2 reveals an interesting result: even when samples out of total samples are corrupted, our estimator (3) can successfully recover the true parameter with guaranteed error in (8). The first term in this bound is which exactly recovers the Frobenius error bound for the case without outliers (see [13, 22] for example). Due to the outliers, we have the performance degrade with the second term, which is . To the best of our knowledge, this is the first statistical error bounds on the parameter estimation for Gaussian graphical models with outliers. Also note that Corollary 1 only concerns on any local optimal point derived by an arbitrary optimization algorithm. For the guarantees of multiple local optima simultaneously, we may use a union bound from the corollary.
When Outliers Follow a Gaussian Graphical Model
Now let us provide a concrete example and show how in 1 is precisely specified in this case:
[leftmargin=0.1cm, itemindent=1.35cm,label=(C-B3), ref=(C-B3)]
Outliers in the set are drawn from another Gaussian graphical model (1) with a parameter .
This can be understood as the Gaussian mixture model where the most of the samples are drawn fromthat we want to estimate, and small portion of samples are drawn from . In this case, Corollary 2 can be further shaped as follows:
In this section we corroborate the performance of our Trimmed Graphical Lasso (trim-glasso) algorithm on simulated data. We compare against glasso: the vanilla Graphical Lasso ; the t-Lasso and t*-lasso methods , and robust-LL: the robustified-likelihood approach of .
4.1 Simulated data
Our simulation setup is similar to  and is a akin to gene regulatory networks. Namely we consider four different scenarios where the outliers are generated from models with different graphical structures. Specifically, each sample is generated from the following mixture distribution:
where and . Four different outlier distributions are considered:
M1: , M2: ,
M3: , M4: .
We also consider the scenario where the outliers are not symmetric about the mean and simulate data from the following model
For each simulation run, is a randomly generated precision matrix corresponding to a network with hub nodes simulated as follows. Let be the adjacency of the network. For all we set with probability 0.03, and zero otherwise. We set We then randomly select 9 hub nodes and set the elements of the corresponding rows and columns of to one with probability 0.4 and zero otherwise. Using , the simulated nonzero coefficients of the precision matrix are sampled as follows. First we create a matrix so that if , and is sampled uniformly from if Then we set Finally we set where
is the smallest eigenvalue ofis a randomly generated precision matrix in the same way is generated.
For the robustness parameter of the robust-LL method, we consider as recommended in . For the trim-glasso method we consider Since all the robust comparison methods converge to a stationary point, we tested various initialization strategies for the concentration matrix, including , and the estimate from glasso. We did not observe any noticeable impact on the results.
Figure 1 presents the average ROC curves of the comparison methods over 100 simulation data sets for scenarios M1-M4 as the tuning parameter varies. In the figure, for robust-LL and trim-glasso methods, we depict the best curves with respect to parameter and respectively. Due to space constraints, the detailed results for all the values of and considered, as well as the results for model M5 are provided in the Supplements.
From the ROC curves we can see that our proposed approach is competitive compared the alternative robust approaches t-lasso, t*-lasso and robust-LL. The edge over glasso is even more pronounced for scenarios M2, M4 and M5. Surprisingly, trim-glasso with achieves superior sensitivity for nearly any specificity.
Computationally the trim-glasso method is also competitive compared to alternatives. The average run-time over the path of tuning parameters is 45.78s for t-lasso, 22.14s for t*-lasso, 11.06s for robust-LL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single computing node with a Intel Core i5 2.5GHz CPU and 8G memory. For t-lasso, t*-lasso and robust-LL we used the R implementations provided by the methods’ authors. For glasso we used the glassopath package.
4.2 Application to the analysis of Yeast Gene Expression Data
We analyze a yeast microarray dataset generated by . The dataset concerns yeast segregants (instances). We focused on genes (variables) belonging to cell-cycle pathway as provided by the KEGG database 
. For each of these genes we standardize the gene expression data to zero-mean and unit standard deviation. We observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers. For example the histogram of geneORC3 is presented in Figure 2(a).
For the robust-LL method we set and for trim-glasso we use We use 5-fold-CV to choose the tuning parameters for each method. After is chosen for each method, we rerun the methods using the full dataset to obtain the final precision matrix estimates.
Figure 2(b) shows the cell-cycle pathway estimated by our proposed method. For comparison the cell-cycle pathway from the KEGG  is provided in the Supplements. It is important to note that the KEGG graph corresponds to what is currently known about the pathway. It should not be treated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also be caused by inherent limitations in the dataset used for modeling. For instance, some edges in cell-cycle pathway may not be observable from gene expression data. Additionally, the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links.
glasso tends to estimate more links than the robust methods. We postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links. Robust methods tend to estimate networks that are more consistent with that from the KEGG (-score of 0.23 for glasso, 0.37 for t*-lasso, 0.39 for robust-NLL and 0.41 for trim-glasso, where the
- Lauritzen  S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.
- Oh and Deasy  Jung Hun Oh and Joseph O. Deasy. Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm. BMC Bioinformatics, 15(S-7):S5, 2014.
- Manning and Schutze  C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
- Woods  J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846–850, October 1978.
- Hassner and Sklansky  M. Hassner and J. Sklansky. Markov random field models of digitized image texture. In ICPR78, pages 538–540, 1978.
- Cross and Jain  G. Cross and A. Jain. Markov random field texture models. IEEE Trans. PAMI, 5:25–39, 1983.
- Ising  E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik, 31:253–258, 1925.
- Ripley  B. D. Ripley. Spatial statistics. Wiley, New York, 1981.
- Yang et al.  E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for graphical models. In Neur. Info. Proc. Sys. (NIPS), 27, 2014.
- Yuan and Lin  M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
- Friedman et al.  J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 2007.
- Bannerjee et al.  O. Bannerjee, , L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485–516, March 2008.
- Ravikumar et al.  P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
- Boyd and Vandenberghe  S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004.
- Meinshausen and Bühlmann  N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436–1462, 2006.
- Yang et al.  E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur. Info. Proc. Sys. (NIPS), 25, 2012.
- Tibshirani  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
Daye et al. 
Z.J. Daye, J. Chen, and Li H.
High-dimensional heteroscedastic regression with an application to eqtl data analysis.Biometrics, 68:316–326, 2012.
- Finegold and Drton  Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. The Annals of Applied Statistics, 5(2A):1057–1080, 2011.
- Sun and Li  H. Sun and H. Li. Robust Gaussian graphical modeling via l1 penalization. Biometrics, 68:1197–206, 2012.
- Alfons et al.  A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat., 7:226–248, 2013.
- Loh and Wainwright  P-L Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Arxiv preprint arXiv:1305.2436v2, 2013.
- Hsieh et al.  C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011.
- Nesterov  Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain (UCL)., 2007.
- Nguyen and Tran  N. H. Nguyen and T. D. Tran. Robust Lasso with missing and grossly corrupted observations. IEEE Trans. Info. Theory, 59(4):2036–2058, 2013.
- Negahban et al.  S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
- Brem and Kruglyak  Rachel B Brem and Leonid Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences of the United States of America, 102(5):1572–1577, 2005.
- Kanehisa et al.  M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Res., 42:D199–D205, 2014.
- Candes et al.  E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, August 2006.
Introduction to the non-asymptotic analysis of random matrices.In Y. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. forthcoming.
Appendix A Useful lemma(s)
Lemma 2 (Lemma 1 of ).
Suppose that are iid samples from with . Let be the event that
where is any constant greater than 2. Then, the probability of event occurring is at least .
Appendix B Proof of Theorem 1
This inequality comes from the first order stationary condition (see  for details) in terms of only fixing at . Therefore, if we take above, we have
where is true support set of , the inequalities and hold by respectively the convexity and the triangular inequality of norm.
Now, suppose that . Then, we have
Since , we can set so that . Hence, for , the condition 1 holds by assumption of the statement:
Combining with (B) yields
By the choice of in the assumption of the statement and by the fact that and , we can rearrange the above inequality into
which conflicts with the assumption in the beginning of this proof. Hence, by contradiction, we can conclude under conditions in the statement. ∎
Now, by the RSC condition in 1 for , we obtain
By the condition 1
and at the same time, by (B),
Finally, by the assumption on and Hölder’s inequity,
Combining all pieces together yields
where is defined as the absolute sum of diagonal elements of a matrix, and . Therefore, (B) implies
At the same time in order to derive error bound, we again use the inequality in (B):
which completes the proof.