1 Introduction
Graphical Gaussian models have attracted a lot of recent interest. In these models an observed random vector
is assumed to follow a multivariate normal distribution
, where is the mean vector and the positive definite covariance matrix. Each model is associated with an undirected graph with vertex set , and defined by requiring that for each nonedge , the variables and are conditionally independent given all the remaining variables . Here, denotes the complement . Such pairwise conditional independence holds if and only if ; see Lauritzen (1996) for this fact and general background on graphical models. Therefore, inferring the graph corresponds to inferring the nonzero elements of .Classical solutions to the model selection problem include constraintbased approaches that test the modeldefining conditional independence constraints, and scorebased searches that optimize a model score over a set of graphs. A review of this work can be found in Drton and Perlman (2007). Recently, however, penalized likelihood approaches based on the onenorm of the concentration matrix have become increasingly popular. Meinshausen and Bühlmann 2006
proposed a method that uses lasso regressions of each variable
on the remaining variables . In subsequent work, Yuan and Lin (2007) and Banerjee, El Ghaoui and d’Aspremont (2008) discuss the computation of the exact solution to the convex optimization problem arising from the likelihood penalization. Finally, Friedman, Hastie and Tibshirani (2008) developed the graphical lasso (glasso), which is a computationally efficient algorithm that maximizes the penalized loglikelihood function through coordinatedescent. The theory that accompanies these algorithmic developments supplies highdimensional consistency properties under assumptions of graph sparsity; see, for example, Ravikumar et al. (2009).Inference of a graph can be significantly impacted, however, by deviations from normality. In particular, contamination of a handful of variables in a few experiments can lead to a drastically wrong graph. Applied work thus often proceeds by identifying and removing such experiments before data analysis, but such outlier screening can become difficult with large data sets. More importantly, removing entire experiments as outliers may discard useful information from the uncontaminated variables they may contain.
The existing literature on robust inference in graphical models is fairly limited. One line of work concerns constraintbased approaches and adopts robustified statistical tests [Kalisch and Bühlmann (2008)]. An approach for fitting the model associated with a given graph using a robustified likelihood function is described in Miyamura and Kano (2006)
. In some cases simple transformations of the data may be effective at minimizing the effect of outliers or contaminated data on a small scale. A normal quantile transformation, in particular, appears to be effective in many cases.
In this paper we extend the scope of robust inference by providing a tool for robust model selection that can be applied with highly multivariate data. We build upon the glasso of Friedman, Hastie and Tibshirani (2008), but model the data using multivariate distributions. Using the EM algorithm, the tlasso methods we propose are only slightly less computationally efficient than the glasso but cope rather well with contaminated data.
The paper is organized as follows. In Section 2 we review maximization of the penalized Gaussian loglikelihood function using the glasso. In Section 3 we introduce the classical multivariate distribution and describe maximization of the (unpenalized) loglikelihood using the EM algorithm. In Section 4 we combine the two techniques into the tlasso to maximize the penalized loglikelihood in the multivariate case. In Section 5 we introduce an alternative multivariate distribution and describe how inference can be done using stochastic and variational EM. In Section 6 we compare the glasso to our based methods on simulated data. Finally, in Section 7 we analyze two different gene expression data sets using the competing methods. Our findings are summarized in Section 8.
2 Graphical Gaussian models and the graphical lasso
Suppose we observe a sample of independent random vectors that are distributed according to the multivariate normal distribution . Likelihood inference about the covariance matrix is based on the loglikelihood function
where the empirical covariance matrix
is defined based on deviations from the sample mean . Let denote the (
)concentration matrix. In penalized likelihood methods a onenorm penalty is added to the loglikelihood function, which effectively performs model selection because the resulting estimates of
may have entries that are exactly zero. Omitting irrelevant factors and constants, we are led to the problem of maximizing the function(1) 
over the cone of positive definite matrices, where is the sum of the absolute values of the entries of . The multiplier is a positive tuning parameter. Larger values of lead to more entries of being estimated as zero. Crossvalidation or information criteria can be used to tune .
The glasso is an iterative method for solving the convex optimization problem with the objective function in (1). Its updates operate on the covariance matrix . In each step one row (and column) of the symmetric matrix is updated based on a partial maximization of (1) in which all but the considered row (and column) of are held fixed. This partial maximization is solved via coordinatedescent as briefly reviewed next.
Partition off the last row and column of and as
Then, as shown in Banerjee, El Ghaoui and d’Aspremont (2008), partially maximizing with held fixed yields , where minimizes
with respect to . The glasso finds by coordinate descent in each of the coordinates , using the updates
where . The algorithm then cycles through the rows and columns of and until convergence. The diagonal elements are simply . See Friedman, Hastie and Tibshirani (2008) for more details on the method.
3 Graphical models based on the distribution
3.1 Classical multivariate distribution
The classical multivariate distribution on has Lebesgue density
(2) 
with and . The vector and the positive definite matrix
determine the first two moments of the distribution. If
with degrees of freedom, then the expectation is and the covariance matrix is . From here on we will always assume for the covariance matrix to exist. For notational convenience and to illustrate the parallels with the Gaussian model, we define .If
is a multivariate normal random vector independent of the Gammarandom variable
, then is distributed according to ; see Kotz and Nadarajah (2004), Chapter 1. This scalemixture representation, illustrated in Figure 1, allows for easy sampling. It also clarifies how the use of distributions leads to more robust inference because extreme observations can arise from small values of . An additional useful fact is that the conditional distribution of givenis again a Gammadistribution, namely,
(3) 
Let be a graph with vertex set . We define the associated graphical model for the distribution by requiring that for indices corresponding to a nonedge . This mimics the Gaussian model in that zero constraints are imposed on the inverse of the covariance matrix. However, in a distribution this no longer corresponds to conditional independence, and the density
does not factor according to the graph. The conditional dependence manifests itself, in particular, in conditional variances in that even if
,For a simple illustration of this inequality, let be a diagonal matrix. Then
which can be shown by taking iterated conditional expectations, and using that
and that given has a Gammadistribution; recall (3). Clearly, depends on all , .
Despite the lack of conditional independence, the following property still holds (proved in the Appendix).
Proposition 1
Let , where for pairs of indices that correspond to nonedges in the graph . Let be independent of and follow any distribution on the positive real numbers with and define . If two nodes and are separated by a set of nodes in , then and are conditionally uncorrelated given .
The edges in the graph indicate the allowed conditional independencies in the latent Gaussian vector . According to Proposition 1, however, we may also interpret the graph in terms of the observed variables . The zero conditional correlations entail that meansquare error optimal prediction of variable can be based on the variables that correspond to neighbors of the node in the graph, which is a very appealing property.
3.2 EM algorithm for estimation
The lack of density factorization properties complicates likelihood inference with distributions. However, the EM algorithm provides a way to circumvent this issue. Equipped with the normalGamma construction, we treat as a hidden variable and use that the conditional distribution of given is . We now outline the EM algorithm for the distribution assuming the degrees of freedom to be known. If desired, could also be estimated in a line search that is best based on the actual likelihood [Liu and Rubin 1995].
Consider an sample drawn from . Let be an associated sequence of hidden Gammarandom variables. Observation of the would lead to the following completedata loglikelihood function for and :
where, with some abuse, the symbol indicates that irrelevant additive constants are omitted. The completedata sufficient statistics
are thus linear in . We obtain the following EM algorithm for computing the maximum likelihood estimates of and :
 Estep:

The Estep is simple because
(5) Given current estimates and , we compute in the st iteration
 Mstep:

Calculate the updated estimates
(6) (7)
4 Penalized inference in distribution models
Model selection in graphical models can be performed, in principle, by any of the classical constraint and scorebased methods. In scorebased searches through the set of all undirected graphs on nodes, however, each model would have to be refit using an iterative method such as the algorithm from Section 3.2. The penalized likelihood approach avoids this problem.
Like in the Gaussian case, we put a onenorm penalty on the elements of and wish to maximize the penalized loglikelihood function
(8) 
where is the density from (2). To achieve this, we will use a modified version of the EM algorithm taking into account the onenorm penalty.
We treat as missing data. In the Estep of our algorithm, we calculate the conditional expectation of the penalized completedata loglikelihood
(9) 
with
Since is again linear in , the Estep takes the same form as in Section 3.2. Let and be the estimates after the th iteration, and the conditional expectation of calculated in the st Estep. Then in the Mstep of our algorithm we wish to maximize
with respect to and . Differentiation with respect to yields from (6) for any value of . Therefore, is found by maximizing
(10) 
The quantity in (10), however, is exactly the objective function maximized by the glasso.
Iterating the E and Msteps just described, we obtain what we call the tlasso algorithm. Since the onenorm penalty forces some elements of exactly to zero, the tlasso performs model selection and parameter estimation in a way that is similar to structural EM algorithms [Friedman (1997)]. Convergence to a stationary point is guaranteed in the penalized version of the EM algorithm [McLachlan and Krishnan (1997), Chapter 1.6]; typically a local maximum is found. Note also that the maximized loglikelihood function is not concave, and so one finds oneself in the usual situation of not being able to give any guarantees about having obtained a global maximum.
5 Alternative model
5.1 Specification of the alternative model
The tlasso from Section 4
performs particularly well when a small fraction of the observations are contaminated (or otherwise extreme). In this case, these observations are downweighted in entirety, and the gain from reducing the effect of contaminated nodes outweighs the loss from throwing away good data from other nodes. In highdimensional data sets, however, the contamination, or other deviation form normality, may be in small parts of many observations. Downweighting entire observations may then no longer achieve the desired results. We will demonstrate this later in simulations (see the bottom panel of Figure
4).To handle the above situation better, we consider an alternative extension of the univariate distribution, illustrated in Figure 2. Instead of one divisor per variate observation, we draw divisors . For , let be independent of each other and of . We then say that the random vector with coordinates follows an alternative multivariate distribution; in symbols .
Unlike for the classical multivariate distribution, the covariance matrix is no longer a constant multiple of when . Clearly, the coordinate variances are still the same, namely,
but the covariance between and with is now
The same matrix thus implies smaller correlations (by the same constant multiple) in the distribution. This reduced dependence is not surprising in light of the fact that now different and independent divisors appear in the different coordinates. Despite the decrease in marginal correlations, the result of Proposition 1 does not hold for conditional correlations in the alternative model. That is, does not imply and are conditionally uncorrelated given . Interpretation of the graph in the alternative model is thus limited to considering edges to represent the allowed conditional dependencies in the latent multivariate normal distribution.
The following simulation confirms the result and illustrates the effect. We consider a distribution with
and draw independent samples until we have 500,000 observations with for values of in the range . The sample correlations of and given the varying values of are shown in Figure 3.
5.2 Alternative tlasso
Inference in the alternative model presents some difficulties because the likelihood function is not available explicitly. The completedata loglikelihood function , however, is simply the product of the evaluations of Gammadensities ( being a vector now) and a multivariate normal density. We can thus implement an EMtype procedure if we are able to compute the conditional expectation of given . This time we treat the random variables as hidden for each observation . Unfortunately, the conditional expectation is intractable. It can be estimated, however, using Markov Chain Monte Carlo.
The completedata loglikelihood function is equal to
(11) 
where
and is the diagonal matrix with along the diagonal. The trace in (11) is linear in the entries of the matrix . A Gibbs sampler for estimating the conditional expectation of this matrix given cycles through the coordinates indexed by and draws, in its th iteration, a number from the conditional distribution of given . This full conditional has density
(12) 
with
(13) 
and normalizing constant . This constant can always be expressed using hypergeometric functions, but, as we detail below, much simpler formulas can be obtained for the small integer degrees of freedom that are of interest in practice. The simpler formulas are obtained by partial integration.
From and in (13), form the ratio . In order to sample from the distribution in (12), we may draw from
(14) 
and divide the result by . For our default of , that is, , we thus need to sample from
(15) 
Writing
for the cumulative distribution function of the standard normal distribution, the normalizing constant becomes
(16) 
For , the density is a density. For moderate , we are thus led to the following rejection sampling procedure to draw from .
Let be the density of a distribution. Rejection sampling using the family of densities as instrumental densities proceeds by drawing a proposal and a uniform random variable and either accept if or repeat the process until acceptance. Here, is a suitable multiplier such that for all .
An important ingredient to the rejection sampler is the parameter , which we choose as follows. In the case , the density has a heavier tail than provided that . Focusing on the case , we have that for a given the smallest such that for all is
Varying , the multiplier is minimized at
If , then setting yields a heavy enough tail and .
The rejection sampling performs draws from the exact conditional distribution . We find it works very well for data with not too extreme contamination such as, for instance, in the original as well as bootstrap data from the application discussed in Section 7.2. When applied to data with very extreme observations , however, one is faced with larger positive values of . In this case the instrumental densities provide a poor approximation to the target density
, and the acceptance probabilities in the rejection sampling step become impractically low.
For , we thus use an alternative rejection procedure. Make the transformation . We then wish to sample from
Any distribution has a heavier tail than the target distribution . While it is not possible to find an analytical solution for the optimal and , letting and yields acceptance probabilities between and for most plausible values of . Since this alternative procedure will only be needed occasionally, these acceptance problems are adequate. Using this hybrid approach yields overall acceptance probabilities greater than for the data with extreme contamination described in Section 7.1.
Returning to the iterations of the overall sampler, we calculate at the end of each cycle through the nodes, and then take the average over iterations. This solves the problem of carrying out one Estep, and we obtain the following stochastic penalized EM algorithm, which we call the Monte Carlo lasso (or lasso for short):
 Estep:

Given current estimates and , compute by averaging the matrices obtained in some large number of Gibbs sampler iterations, as described above.
 Mstep:

Calculate the updated estimates
Use these and to compute the matrix to be plugged into the trace term in (11). Maximize the resulting penalized loglikelihood function using the glasso.
5.3 Variational approximation
The above Monte Carlo procedure loses much of the computational efficiency of the classical tlasso from Section 4, however, and can be prohibitively expensive for large . For large problems, we turn instead to variational approximations of the conditional density of the vector given the observed vector .
The variational approach proceeds by approximating the conditional density by a factoring distribution. In our context, however, it is easier to approximate the joint density instead. The first term is already in product form because we are assuming the individual divisor to be independent in the model formulation, and the second term is the density of the multivariate normal distribution
We approximate this normal distribution by a member of the set of multivariate normal distributions with diagonal covariance matrix. Application of this naive mean field procedure, that is, choosing a distribution by minimizing Kullback–Leibler divergence, leads to the approximating distribution
(17) 
where is the diagonal matrix with the same diagonal elements as [Wainwright and Jordan (2008), Chapter 5]. Writing for the density of the distribution in (17), our approximation thus has the fully factoring form . The resulting conditional distribution also factors as
where is the density of the Gammadistribution , with its parameters corresponding to the quantities and in (13).
In conclusion, the variational Estep consists of calculating, for each observation , the expectations
and . These values are then substituted into (11). The Mstep is the same as in the lasso.
The effect of the variational approximation is that the weight for node in observation is based solely on the squared deviation from the mean, and the conditional variance . For a given deviation from the mean, the larger the conditional variance of the node, the smaller the weight given to that node in that observation. But unlike in the lasso, no consideration is given to deviation from the conditional mean of the node in question given the rest. Some relevant information is therefore not being used, but in our simulations the effect was not noticeable.
The resulting variational lasso (lasso) is only slightly more expensive than the tlasso and, despite the relatively crude approximation in the variational Estep, performs well compared with the lasso. Because of this, we will use exclusively the lasso when considering the alternative model in the simulations in the next section.
6 Simulation results
6.1 Procedure
We used simulated data to compare the three procedures glasso, tlasso and lasso as follows. We generated a random sparse inverse covariance (or dispersion) matrix according to the following procedure:

[(a)]

Choose each lowertriangular element of independently to be , or with probability , and , respectively.

For set .

Define where is the number of nonzero elements in the th row of .
The final step ensures a strictly diagonally dominant, and thus positivedefinite matrix. To strengthen the partial correlations, we reduced the diagonal elements by a common factor. We made this factor as large as possible while maintaining positivedefiniteness and stability for inversion. For these particular matrices, fixing a minimum eigenvalue of
worked well.We then generated observations from the distribution and ran each of the three procedures with a range of values for the onenorm tuning parameter . To compare how well the competing methods recovered the true edges, we drew ROC curves. We ran this whole process 250 times and then repeated the entire computation, drawing data from and then distributions.
Simulating from distributions produces extreme observations, but a more realistic setting might be one in which normal data is contaminated in some fashion. For instance, consider broken probes or misread observations in a large gene expression microarray. Suppose the contaminated data are not so extreme as to be manually screened or otherwise identified as obvious outliers. To simulate this phenomenon, we generate normal data as above, but randomly contaminated of the values with data generated from independent univariate random variables, where is equal to times the largest diagonal element of . These contaminated values will be similar in magnitude to the tail of the original distribution and therefore difficult to identify.
Finally, we would like to compare our developed procedures with simpler approaches to robust inference. There are many ways to obtain robust estimates of the covariance matrix, but these usually require . Instead we obtain a robust estimate for the marginal covariances and variances using the procedure of Kalisch and Bühlmann (2008). Since this is not guaranteed to result in a positive definite matrix, we add a constant, , to the diagonal elements of the matrix, where is the minimum constant necessary to ensure the resulting matrix is nonnegative definite. We then use this robust estimate of the covariance matrix as input into the glasso and refer to this procedure as the robust glasso.
6.2 Results
Our tlasso and lasso are computationally more expensive, since they call the glasso at each Mstep. But in our simulations, the algorithms converge quickly. If we run through multiple increasing values of the tuning parameter for the onenorm penalty, it may take about – EM iterations for the initial small value of , but only 2 or 3 iterations for later values, as we can “warm start” at the previous output. But even in the initial run, two iterations typically lead to a drastic improvement (in the likelihood) over the glasso.
The only caveat is that the function being maximized by the tlasso methods is not guaranteed to be unimodal. We thus started in several places, and let the algorithm run for longer than probably necessary in practice. We did not observe drastically different results from different starting places. Nonetheless, since we are not guaranteed to find a global maximum, the statistical performances of the tlasso and lasso may, in principle, be understated here (and, of course, the computational efficiency overstated).
In the worst case scenario for our procedures relative to the glasso—when the data is normal and we assume distributions with degrees of freedom—almost no statistical efficiency is lost. In the numerous simulations we have run using normal data, the tlasso and glasso do an essentially equally good job of recovering the true graph (see Figure 4). The lasso performs surprisingly well at small to moderate false discovery rates. The robust glasso is based on a less efficient estimator and does not perform as well as the other procedures.
For data generated from a classical distribution with degrees of freedom, the tlasso provides drastic improvement over the glasso at the low false positive rates that are of practical interest. The assumed normality and the occasional extreme observation lead to numerous false positives when using the glasso. Therefore, there is very little computational—and little or no statistical—downside to assuming distributions, but significant statistical upside. Interestingly, the lasso performs about as well as the tlasso. The robust glasso outperforms the purely Gaussian procedure at low false positive rates, since it is less susceptible to the most extreme observations.
In the third case, with data generated from the alternative distribution with degrees of freedom, only the lasso is able to recover useful information without substantial noise. The occasional large values are too extreme for the normal model to explain and downweighting entire observations, as is done by the tlasso, discards too much information when there are extreme values scattered throughout the data. The robust glasso offers only a small improvement over the glasso.
With the contaminated data, the lasso does not perform as well in this case as it does with data. The extreme values are not downweighted as much and, thus, the signals are noisier. It still performs far better, however, than either of the other methods, and is able to recover valuable information in a case where manual screening of outliers would be very difficult. The robust glasso does not perform as well as the lasso, but offers a clear improvement over the glasso and might be a useful alternative.
6.3 Notes on simulation
The simulations show that the tlasso performs very similarly to the glasso even with normal data. While one would expect a model based on the distribution to fare better with normal data than a normal model would with data, the fact that there is almost no statistical loss from the model misspecification is at first a bit surprising. The similarity of the results can be explained, however, by comparing the two procedures. In effect, the only difference is that the tlasso inputs a weighted sample covariance matrix into the glasso procedure; one can then think of the glasso as the tlasso with all weights set to one.
As noted in Section 3.2, these weights are the conditional expectations of , which are, from equation (5),
(18) 
where is our estimate or assumption of the unknown degrees of freedom. If and , then is distributed according to the distribution [Kotz and Nadarajah (2004), Chapter 3]. Thus, starting with the true values of and , the variance of the inverse weights is
For normal data (i.e., ), the variance is and goes to very quickly as gets large, no matter the assumed value of . If our current estimate of is reasonably close to the true , then the observations will likely have very similar weights and the weighted covariance matrix will be very close to the sample covariance matrix. For data, the above variance tends to for large ; so no matter how many variables we have, the distribution of the inverse weights will have positive variance and the tlasso and glasso estimates are less likely to agree.
7 Gene expression data
7.1 Galactose utilization
We consider data from microarray experiments with yeast strands [Gasch et al. (2000)]. As in Drton and Richardson (2008), we limit this illustration to genes involved in galactose utilization. An assumption of normality is brought into question, in particular, by the fact that in out of experiments with data for all 8 genes, the measurements for 4 of the genes were abnormally large negative values. In order to assess the impact of this handful of outliers, we run each algorithm, adjusting the penalty term such that a graph with a given number of edges is inferred. Somewhat arbitrarily we focus on the top 9 edges. We do this once with all 136 experiments and then again excluding the potential outliers.
As seen in Figure 5, the glasso infers very different graphs, with only 3 edges in common. When the “outliers” are included, the glasso estimate in Figure 5(a) has the 4 nodes in question fully connected; when they are excluded, no edges among the 4 nodes are inferred. The tlasso does not exhibit this extreme behavior. As seen in Figure 5(b), it recovers almost the same graph in each case (7 out of 9 edges shared). When run with all the data, the estimate is very small () for each of the questionable observations compared with the average estimate of . The graph in Figure 5(c) shows the results from the lasso which performs just as well as the tlasso. The lasso also recovered 7 edges in both graphs (not shown) and infers relationships similar to those found by the lasso.
Figure 6 illustrates the flexibility of the weighting schemes of the various procedures. Both procedures downweight the 11 potential outliers observations for the 4 nodes in question, but not for the other nodes. Thus, the alternative version is able to extract information from the “uncontaminated” part of the observations while downweighting the rest. In this particular case, with 125 other observations, downweighting the outliers is of primary importance, and, thus, the increased flexibility of the lasso over the tlasso does not make much of a difference in the inferred graphs. This might not be the case with a higher contamination level.
7.2 Isoprenoid pathway
We next consider gene expression data for the isoprenoid pathways of Arabidopsis thaliana discussed in Wille et al. (2004). Gene expressions were measured in 118 Affymetrix microarrays for 39 genes. While the data set described in the above section had clear deviations from normality, the data described in this section has no obvious deviations that stand out in exploratory plots.
Two approaches were considered in Wille et al. (2004). The first (GGM1) fit a Gaussian graphical model using BIC and backward selection to obtain a network with 178 edges. This number was deemed too large for interpretation, and the authors considered instead only the 31 edges found in at least of bootstrapped samples. The second approach (GGM2) tests the conditional independence of each pair of genes given a third gene. An edge is drawn only if a test of conditional independence is rejected for each other gene in the network. This approach is advocated in the paper and appears to find a network with better biological interpretation. The graph is shown in Figure 7, where shaded nodes indicate the socalled MEP pathway.
Our approach is modeled after GGM1. We used the lasso and increasing values or to find a path of models to test. For each chosen model, we ran the lasso again, but this time without penalty on the allowed edges. Since the likelihood is unavailable, we use leaveoneout crossvalidation to find the model with the lowest mean squared prediction error. Since the exact conditionals from the alternative distribution are not available in explicit form, we perform the crossvalidation as follows:

[(b)]

Estimate using all but one observation.

In the remaining observation, estimate the values of the latent normal variables for all but one of the coordinates in the same manner as the variational Estep of Section 5.3.

Predict the remaining normal value.

Scale the normal value by the expectation of .
We remark that we also experimented with leaving out a larger fraction of the observations as suggested in the work of Shao (1993), but this led to similar conclusions in the present example.
The crossvalidation procedure gave a network with 122 edges. To reduce to the graph size found by GGM2, we took 500 bootstrapped samples of the data, fixing the parameter found in crossvalidation, and only included those edges found in more than of the samples. For comparison, we also ran the above procedure using the glasso, but keeping of the samples to obtain the samesized graph.
We believe our procedure infers a graph that compares favorably (in terms of biological interpretation) with that found by GGM2. Like GGM2, we find a connection between AACT2 and the group MK, MPDC1 and FPPS2; GGM1 found AACT2 to be disconnected from the rest of the graph despite its high correlation with these three genes. In the MEP pathway, our approach and GGM2 find similar structure; compare Figures 7 and 8.
While our approach finds the key relationships identified in Wille et al., it achieves this with fewer “crosstalk” edges between the two pathways. The authors discuss plausible interpretations for such interactions between the pathways, but a graph with less crosstalk might be closer to the scientists’ original expectation (Figures 7 and 8). It is worth noting that the glasso procedure performs better than GGM1, with edge inclusion being far less sensitive to the particular bootstrapped sample. The glasso also finds the key relationships of GGM2. We also ran the tlasso, which gave results similar to the glasso and with the lasso, which behaved similar to the lasso. We do not show these results here.
8 Discussion
Our proposed tlasso and lasso algorithms are simple and effective methods for robust inference in graphical models. Only slightly more computationally expensive than the glasso, they can offer great gains in statistical efficiency. The alternative t distribution is more flexible than the classical and is generally preferred. We find that the simple variational Estep is an efficient way to estimate the graph in the alternative case, but also explored more sophisticated Monte Carlo approximations.
We assumed degrees of freedom in our various tlasso and lasso runs. As suggested in prior work on distribution models, estimation of the degrees of freedom can be done efficiently by a line search based on the observed loglikelihood function in the classical model.
In the alternative model, the choice of puts an explicit upper bound on the maximum correlation between two variables, the upper bound increasing quickly with (see Figure 9). This makes inference of the degrees of freedom potentially more relevant than with the classical model, as an alternative model with small might not be a good fit for highly correlated variables. In order to select , a line search based on the hidden loglikelihood function can be employed. For further flexibility, we may also allow the degrees of freedom to vary with each node. That is, we could let the divisors be independent divisors with possible different degrees of freedom . This leads to similar conditionals in the Gibbs sampler and the resulting procedure is thus no more complicated. Nevertheless, for the purposes of graph estimation, our experience and related literature suggest that not much is lost by considering only a few small values for the degrees of freedom. For instance, running the lasso procedure in Section 7.2 using produces a very similar result with one additional crosstalk edge.
In the last section we used crossvalidation to choose the onenorm tuning parameter . The likelihood is not available explicitly for the distribution and so we cannot easily use information criteria for the lasso. Crossvalidation often tends to pick more edges than is desirable, however, when the goal is inference of the graph and not optimal prediction. An interesting but potentially difficult problem for future research would be to develop rules for choosing that control an edge inclusion error rate; compare Banerjee, El Ghaoui and d’Aspremont (2008); Meinshausen and Bühlmann (2006).
Throughout the paper, we have penalized all the elements of . One alternative is to remove the penalty from the diagonal elements of , since we expect all these to be nonzero. This leads to smaller estimated partial correlations, and we found it to result in less stable behavior of the tlasso in the sense of the number of edges decreasing rather suddenly as increases.
Finally, we remark that other normal scalemixture models could be treated in a similar fashion as the distribution models we considered in this paper. However, the use of distributions is particularly convenient in that it is rather robust to various types of misspecification, involves only the choice of the degrees of freedom parameters for the distribution of Gammadivisors, and maintains good efficiency when data are Gaussian.
Appendix
Proof of Proposition 1 According to standard graphical model theory [Lauritzen (1996)], it suffices to show that and are conditionally uncorrelated given . Partition into and . For a given value of ,
and
Since ,
for any value of . Therefore,
which implies that and are conditionally uncorrelated given .
References
 Banerjee, El Ghaoui and d’Aspremont (2008) [author] Banerjee, O.O., El Ghaoui, L.L. d’Aspremont, A.A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516. 2417243
 Drton and Perlman (2007) [author] Drton, M.M. Perlman, D.D. (2007). Multiple testing and error control in Gaussian graphical model selection. Statist. Sci. 22 430–449. 2416818
 Drton and Richardson (2008) [author] Drton, M.M. Richardson, T.T. (2008). Graphical methods for efficient likelihood inference in Gaussian covariance models. J. Mach. Learn. Res. 9 893–914. 2417257

Friedman (1997)
[author] Friedman, N.N. (1997). Learning belief networks in the presence of missing values and hidden variables. In Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN.
 Friedman, Hastie and Tibshirani (2008) [author] Friedman, J.J., Hastie, T.T. Tibshirani, R.R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
 Gasch et al. (2000) [author] Gasch, A. P.A. P., Spellman, P. T.P. T., Kao, C. M.C. M., CarmelHarel, O.O., Eisen, M. B.M. B., Storz, S.G., Botstein, D.D. Brown, P. O.P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11 4241–4257.
 Kalisch and Bühlmann (2008) [author] Kalisch, M.M. Bühlmann, P.P. (2008). Robustification of the PCalgorithm for directed acyclic graphs. J. Comput. Graph. Statist. 17 773–789. 2649066
 Kotz and Nadarajah (2004) [author] Kotz, S.S. Nadarajah, S.S. (2004). Multivariate Distributions and Their Applications. Cambridge Univ. Press, Cambridge. 2038227
 Lauritzen (1996) [author] Lauritzen, Steffen L.S. L. (1996). Graphical Models. Oxford Univ. Press, New York. 1419991
 Liu and Rubin (1995) [author] Liu, C.C. Rubin, D. B.D. B. (1995). ML estimation of the distribution using EM and its extensions, ECM and ECME. Statist. Sinica 5 19–39. 1329287
 McLachlan and Krishnan (1997) [author] McLachlan, G.G. Krishnan, T.T. (1997). The EM Algorithm and Extensions. Wiley, New York. 1417721
 Meinshausen and Bühlmann (2006) [author] Meinshausen, N.N. Bühlmann, P.P. (2006). High dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. 2278363
 Miyamura and Kano (2006) [author] Miyamura, M.M. Kano, Y.Y. (2006). Robust Gaussian graphical modeling. J. Multivariate Anal. 97 1525–1550. 2275418
 Ravikumar et al. (2009) [author] Ravikumar, P.P., Raskutti, G.G., Wainwright, M. J.M. J. Yu, B.B. (2009). Model selection in Gaussian graphical models: Highdimensional consistency of regularized MLE. In Advances in Neural Information Processing Systems (NIPS) (D. Koller, D. Schuurmans, Y. Bengio and L. Botto, eds.) 21 1329–1336.
 Shao (1993) [author] Shao, J.J. (1993). Linear model selection by crossvalidation. J. Amer. Statist. Assoc. 88 486–494. 1224373
 Wainwright and Jordan (2008) [author] Wainwright, M. J.M. J. Jordan, M. I.M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 1–305.
 Wille et al. (2004) [author] Wille, AnjaA., Zimmermann, PhilipP., Vranova, EvaE., Furholz, AndreasA., Laule, OliverO., Bleuler, StefanS., Hennig, LarsL., Prelic, AmelaA., von Rohr, PeterP., Thiele, LotharL., Zitzler, EckartE., Gruissem, WilhelmW. Bühlmann, PeterP. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 5 R92.
 Yuan and Lin (2007) [author] Yuan, M.M. Lin, Y.Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35. 2367824
Comments
There are no comments yet.