# Robust graphical modeling of gene networks using classical and alternative T-distributions

Graphical Gaussian models have proven to be useful tools for exploring network structures based on multivariate data. Applications to studies of gene expression have generated substantial interest in these models, and resulting recent progress includes the development of fitting methodology involving penalization of the likelihood function. In this paper we advocate the use of multivariate t-distributions for more robust inference of graphs. In particular, we demonstrate that penalized likelihood inference combined with an application of the EM algorithm provides a computationally efficient approach to model selection in the t-distribution case. We consider two versions of multivariate t-distributions, one of which requires the use of approximation techniques. For this distribution, we describe a Markov chain Monte Carlo EM algorithm based on a Gibbs sampler as well as a simple variational approximation that makes the resulting method feasible in large problems.

## Authors

• 1 publication
• 35 publications
• ### Robust Graphical Modeling with t-Distributions

Graphical Gaussian models have proven to be useful tools for exploring n...
08/09/2014 ∙ by Michael A. Finegold, et al. ∙ 0

• ### Unbiased approximation of posteriors via coupled particle Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is a powerful methodology for the approx...
03/09/2021 ∙ by Willem van den Boom, et al. ∙ 0

• ### A Multi-Resolution Spatial Model for Large Datasets Based on the Skew-t Distribution

Large, non-Gaussian spatial datasets pose a considerable modeling challe...
12/06/2017 ∙ by Felipe Tagle, et al. ∙ 0

• ### Computationally efficient inference for latent position network models

Latent position models are nowadays widely used for the analysis of netw...
04/06/2018 ∙ by Riccardo Rastelli, et al. ∙ 0

• ### L1-Penalized Censored Gaussian Graphical Model

Graphical lasso is one of the most used estimators for inferring genetic...
01/24/2018 ∙ by Luigi Augugliaro, et al. ∙ 0

• ### Likelihood-based missing data analysis in multivariate crossover trials

For gene expression data measured in a crossover trial, a multivariate m...
03/11/2021 ∙ by Savita Pareek, et al. ∙ 0

• ### Gain-loss-duplication models on a phylogeny: exact algorithms for computing the likelihood and its gradient

Gene gain-loss-duplication models are commonly based on continuous-time ...
07/23/2021 ∙ by Miklos Csuros, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical Gaussian models have attracted a lot of recent interest. In these models an observed random vector

is assumed to follow a multivariate normal distribution

, where  is the mean vector and the positive definite covariance matrix. Each model is associated with an undirected graph with vertex set , and defined by requiring that for each nonedge , the variables and are conditionally independent given all the remaining variables . Here, denotes the complement . Such pairwise conditional independence holds if and only if ; see Lauritzen (1996) for this fact and general background on graphical models. Therefore, inferring the graph corresponds to inferring the nonzero elements of .

Classical solutions to the model selection problem include constraint-based approaches that test the model-defining conditional independence constraints, and score-based searches that optimize a model score over a set of graphs. A review of this work can be found in Drton and Perlman (2007). Recently, however, penalized likelihood approaches based on the one-norm of the concentration matrix have become increasingly popular. Meinshausen and Bühlmann 2006

proposed a method that uses lasso regressions of each variable

on the remaining variables . In subsequent work, Yuan and Lin (2007) and Banerjee, El Ghaoui and d’Aspremont (2008) discuss the computation of the exact solution to the convex optimization problem arising from the likelihood penalization. Finally, Friedman, Hastie and Tibshirani (2008) developed the graphical lasso (glasso), which is a computationally efficient algorithm that maximizes the penalized log-likelihood function through coordinate-descent. The theory that accompanies these algorithmic developments supplies high-dimensional consistency properties under assumptions of graph sparsity; see, for example, Ravikumar et al. (2009).

Inference of a graph can be significantly impacted, however, by deviations from normality. In particular, contamination of a handful of variables in a few experiments can lead to a drastically wrong graph. Applied work thus often proceeds by identifying and removing such experiments before data analysis, but such outlier screening can become difficult with large data sets. More importantly, removing entire experiments as outliers may discard useful information from the uncontaminated variables they may contain.

The existing literature on robust inference in graphical models is fairly limited. One line of work concerns constraint-based approaches and adopts robustified statistical tests [Kalisch and Bühlmann (2008)]. An approach for fitting the model associated with a given graph using a robustified likelihood function is described in Miyamura and Kano (2006)

. In some cases simple transformations of the data may be effective at minimizing the effect of outliers or contaminated data on a small scale. A normal quantile transformation, in particular, appears to be effective in many cases.

In this paper we extend the scope of robust inference by providing a tool for robust model selection that can be applied with highly multivariate data. We build upon the glasso of Friedman, Hastie and Tibshirani (2008), but model the data using multivariate -distributions. Using the EM algorithm, the tlasso methods we propose are only slightly less computationally efficient than the glasso but cope rather well with contaminated data.

The paper is organized as follows. In Section 2 we review maximization of the penalized Gaussian log-likelihood function using the glasso. In Section 3 we introduce the classical multivariate -distribution and describe maximization of the (unpenalized) log-likelihood using the EM algorithm. In Section 4 we combine the two techniques into the tlasso to maximize the penalized log-likelihood in the multivariate case. In Section 5 we introduce an alternative multivariate -distribution and describe how inference can be done using stochastic and variational EM. In Section 6 we compare the glasso to our -based methods on simulated data. Finally, in Section 7 we analyze two different gene expression data sets using the competing methods. Our findings are summarized in Section 8.

## 2 Graphical Gaussian models and the graphical lasso

Suppose we observe a sample of independent random vectors that are distributed according to the multivariate normal distribution . Likelihood inference about the covariance matrix is based on the log-likelihood function

 ℓ(Σ)=−np2log(2π)−n2logdet(Σ)−n2tr(SΣ−1),−2pt

where the empirical covariance matrix

 S=(sjk)=1nn∑i=1(Yi−¯Y)(Yi−¯Y)T−2pt

is defined based on deviations from the sample mean . Let denote the (

)-concentration matrix. In penalized likelihood methods a one-norm penalty is added to the log-likelihood function, which effectively performs model selection because the resulting estimates of

may have entries that are exactly zero. Omitting irrelevant factors and constants, we are led to the problem of maximizing the function

 logdet(Θ)−tr(SΘ)−ρ∥Θ∥1 (1)

over the cone of positive definite matrices, where is the sum of the absolute values of the entries of . The multiplier is a positive tuning parameter. Larger values of lead to more entries of being estimated as zero. Cross-validation or information criteria can be used to tune .

The glasso is an iterative method for solving the convex optimization problem with the objective function in (1). Its updates operate on the covariance matrix . In each step one row (and column) of the symmetric matrix is updated based on a partial maximization of (1) in which all but the considered row (and column) of are held fixed. This partial maximization is solved via coordinate-descent as briefly reviewed next.

Partition off the last row and column of and as

 Σ=(Σ∖p,∖pΣ∖p,pΣT∖p,pσpp),S=(S∖p,∖pS∖p,pST∖p,pspp).

Then, as shown in Banerjee, El Ghaoui and d’Aspremont (2008), partially maximizing with held fixed yields , where minimizes

 ∥(Σ∖p,∖p)1/2β−(Σ∖p,∖p)−1/2S∖p,p∥2+ρ∥β∥1

with respect to . The glasso finds by coordinate descent in each of the coordinates , using the updates

 β∗j=T(sjp−∑k

where . The algorithm then cycles through the rows and columns of and until convergence. The diagonal elements are simply . See Friedman, Hastie and Tibshirani (2008) for more details on the method.

## 3 Graphical models based on the t-distribution

### 3.1 Classical multivariate t-distribution

The classical multivariate -distribution on has Lebesgue density

 fν(y;μ,Ψ)=Γ((ν+p)/2)|Ψ|−1/2(πν)p/2Γ(ν/2)[1+δy(μ,Ψ)/ν](ν+p)/2 (2)

with and . The vector and the positive definite matrix

determine the first two moments of the distribution. If

with degrees of freedom, then the expectation is and the covariance matrix is . From here on we will always assume for the covariance matrix to exist. For notational convenience and to illustrate the parallels with the Gaussian model, we define .

If

is a multivariate normal random vector independent of the Gamma-random variable

, then is distributed according to ; see Kotz and Nadarajah (2004), Chapter 1. This scale-mixture representation, illustrated in Figure 1, allows for easy sampling. It also clarifies how the use of -distributions leads to more robust inference because extreme observations can arise from small values of . An additional useful fact is that the conditional distribution of given

is again a Gamma-distribution, namely,

 (τ|Y)∼Γ(ν+p2,ν+δY(μ,Ψ)2). (3)

Let be a graph with vertex set . We define the associated graphical model for the -distribution by requiring that for indices corresponding to a nonedge . This mimics the Gaussian model in that zero constraints are imposed on the inverse of the covariance matrix. However, in a -distribution this no longer corresponds to conditional independence, and the density

does not factor according to the graph. The conditional dependence manifests itself, in particular, in conditional variances in that even if

,

 V[Yj|Y∖j]≠V[Yj|Y∖{j,k}].−1pt

For a simple illustration of this inequality, let be a diagonal matrix. Then

 V[Yj|Y∖j]=E[X2j/τ|Y∖j]=1θjj⋅E[τ−1|Y∖j]=1θjj⋅ν+δY∖j(μ∖j,Ψ∖j,∖j)ν+p−3,−2pt

which can be shown by taking iterated conditional expectations, and using that

 E[X2j|Y∖j,τ]=E[X2j|X∖j,τ]=V[Xj|X∖j]=1θjj−2pt

and that given has a Gamma-distribution; recall (3). Clearly, depends on all , .

Despite the lack of conditional independence, the following property still holds (proved in the Appendix).

###### Proposition 1

Let , where for pairs of indices that correspond to nonedges in the graph . Let be independent of  and follow any distribution on the positive real numbers with and define . If two nodes and are separated by a set of nodes  in , then and are conditionally uncorrelated given .

The edges in the graph indicate the allowed conditional independencies in the latent Gaussian vector . According to Proposition 1, however, we may also interpret the graph in terms of the observed variables . The zero conditional correlations entail that mean-square error optimal prediction of variable can be based on the variables that correspond to neighbors of the node in the graph, which is a very appealing property.

### 3.2 EM algorithm for estimation

The lack of density factorization properties complicates likelihood inference with -distributions. However, the EM algorithm provides a way to circumvent this issue. Equipped with the normal-Gamma construction, we treat as a hidden variable and use that the conditional distribution of given is . We now outline the EM algorithm for the -distribution assuming the degrees of freedom to be known. If desired, could also be estimated in a line search that is best based on the actual -likelihood [Liu and Rubin 1995].

Consider an -sample drawn from . Let be an associated sequence of hidden Gamma-random variables. Observation of the would lead to the following complete-data log-likelihood function for  and :

 ℓhid(μ,Θ|Y,τ) ∝ n2logdet(Θ)−12tr(Θn∑i=1τiYiYTi) +μTΘn∑i=1τiYi−12μTΘμn∑i=1τi,

where, with some abuse, the symbol indicates that irrelevant additive constants are omitted. The complete-data sufficient statistics

 Sτ=n∑i=1τi,SτY=n∑i=1τiYi,SτYY=n∑i=1τiYiYTi−2pt

are thus linear in . We obtain the following EM algorithm for computing the maximum likelihood estimates of and :

E-step:

The E-step is simple because

 E[τ|Y]=ν+pν+δY(μ,Ψ).−2pt (5)

Given current estimates and , we compute in the st iteration

 τ(t+1)i=ν+pν+δY(μ(t),Ψ(t)).−2pt
M-step:

Calculate the updated estimates

 μ(t+1) = ∑ni=1τ(t+1)iYi∑ni=1τ(t+1)i, (6) Ψ(t+1) = 1nn∑i=1τ(t+1)i[Yi−μ(t+1)][Yi−μ(t+1)]T. (7)

## 4 Penalized inference in t-distribution models

Model selection in graphical -models can be performed, in principle, by any of the classical constraint- and score-based methods. In score-based searches through the set of all undirected graphs on nodes, however, each model would have to be refit using an iterative method such as the algorithm from Section 3.2. The penalized likelihood approach avoids this problem.

Like in the Gaussian case, we put a one-norm penalty on the elements of  and wish to maximize the penalized log-likelihood function

 ℓρ,obs(μ,Θ|Y)=n∑i=1logfν(Yi;μ,Θ−1)−ρ∥Θ∥1, (8)

where is the -density from (2). To achieve this, we will use a modified version of the EM algorithm taking into account the one-norm penalty.

We treat as missing data. In the E-step of our algorithm, we calculate the conditional expectation of the penalized complete-data log-likelihood

 ℓρ,hid(μ,Θ|Y,τ)∝n2log|Θ|−n2tr(ΘSτYY(μ))−ρ∥Θ∥1 (9)

with

 SτYY(μ)=1nn∑i=1τi(Yi−μ)(Yi−μ)T.

Since is again linear in , the E-step takes the same form as in Section 3.2. Let and be the estimates after the th iteration, and  the conditional expectation of calculated in the st E-step. Then in the M-step of our algorithm we wish to maximize

 n2log|Θ|−n2tr(ΘSτ(t+1)YY(μ))−ρ∥Θ∥1

with respect to and . Differentiation with respect to yields from (6) for any value of . Therefore, is found by maximizing

 n2log|Θ|−n2tr(ΘSτ(t+1)YY(μ(t+1)))−ρ∥Θ∥1. (10)

The quantity in (10), however, is exactly the objective function maximized by the glasso.

Iterating the E- and M-steps just described, we obtain what we call the tlasso algorithm. Since the one-norm penalty forces some elements of exactly to zero, the tlasso performs model selection and parameter estimation in a way that is similar to structural EM algorithms [Friedman (1997)]. Convergence to a stationary point is guaranteed in the penalized version of the EM algorithm [McLachlan and Krishnan (1997), Chapter 1.6]; typically a local maximum is found. Note also that the maximized log-likelihood function is not concave, and so one finds oneself in the usual situation of not being able to give any guarantees about having obtained a global maximum.

## 5 Alternative model

### 5.1 Specification of the alternative t-model

The tlasso from Section 4

performs particularly well when a small fraction of the observations are contaminated (or otherwise extreme). In this case, these observations are downweighted in entirety, and the gain from reducing the effect of contaminated nodes outweighs the loss from throwing away good data from other nodes. In high-dimensional data sets, however, the contamination, or other deviation form normality, may be in small parts of many observations. Downweighting entire observations may then no longer achieve the desired results. We will demonstrate this later in simulations (see the bottom panel of Figure

4).

To handle the above situation better, we consider an alternative extension of the univariate -distribution, illustrated in Figure 2. Instead of one divisor  per -variate observation, we draw divisors . For , let be independent of each other and of . We then say that the random vector with coordinates follows an alternative multivariate -distribution; in symbols .

Unlike for the classical multivariate -distribution, the covariance matrix  is no longer a constant multiple of when . Clearly, the coordinate variances are still the same, namely,

 V[Yj]=νν−2⋅ψjj,

but the covariance between and with is now

 νΓ((ν−1)/2)22Γ(ν/2)2⋅ψjk≤νν−2⋅ψjk.

The same matrix thus implies smaller correlations (by the same constant multiple) in the -distribution. This reduced dependence is not surprising in light of the fact that now different and independent divisors appear in the different coordinates. Despite the decrease in marginal correlations, the result of Proposition 1 does not hold for conditional correlations in the alternative model. That is, does not imply and are conditionally uncorrelated given . Interpretation of the graph in the alternative model is thus limited to considering edges to represent the allowed conditional dependencies in the latent multivariate normal distribution.

The following simulation confirms the result and illustrates the effect. We consider a distribution with

 Θ=⎡⎢⎣10−0.501−0.5−0.5−0.51⎤⎥⎦

and draw independent samples until we have 500,000 observations with for values of in the range . The sample correlations of and given the varying values of are shown in Figure 3.

### 5.2 Alternative tlasso

Inference in the alternative model presents some difficulties because the likelihood function is not available explicitly. The complete-data log-likelihood function , however, is simply the product of the evaluations of Gamma-densities ( being a vector now) and a multivariate normal density. We can thus implement an EM-type procedure if we are able to compute the conditional expectation of given . This time we treat the random variables as hidden for each observation . Unfortunately, the conditional expectation is intractable. It can be estimated, however, using Markov Chain Monte Carlo.

The complete-data log-likelihood function is equal to

 ℓ∗ρ,hid(μ,Θ,|Y,τ)∝n2log|Θ|−n2tr(ΘS∗τYY(μ))−ρ∥Θ∥1, (11)

where

 S∗τYY(μ)=1nn∑i=1D(√τi)(Yi−μ)(Yi−μ)TD(√τi)

and is the diagonal matrix with along the diagonal. The trace in (11) is linear in the entries of the matrix . A Gibbs sampler for estimating the conditional expectation of this matrix given cycles through the coordinates indexed by and draws, in its th iteration, a number from the conditional distribution of given . This full conditional has density

 f(τij|τi∖j,Yi)=C(α,β,γ)⋅τα−1ijexp{−τijβ−√τijγ} (12)

with

 α=ν+12,β=ν+(Yij−μj)2θjj2,γ=(Yij−μj)Θj∖jXi∖j, (13)

and normalizing constant . This constant can always be expressed using hypergeometric functions, but, as we detail below, much simpler formulas can be obtained for the small integer degrees of freedom that are of interest in practice. The simpler formulas are obtained by partial integration.

From and in (13), form the ratio . In order to sample from the distribution in (12), we may draw from

 fα,γ′(t)=C(α,γ′)⋅tα−1exp{−t−√t2γ′} (14)

and divide the result by . For our default of , that is, , we thus need to sample from

 fγ(t)=C(γ)⋅texp{−t−√t2γ}. (15)

Writing

for the cumulative distribution function of the standard normal distribution, the normalizing constant becomes

 1/C(γ)=1+γ2−γ(2γ2+3)√πexp{γ2}(1−Φ(γ√2)). (16)

For , the density is a density. For moderate , we are thus led to the following rejection sampling procedure to draw from .

Let be the density of a distribution. Rejection sampling using the family of densities as instrumental densities proceeds by drawing a proposal and a uniform random variable and either accept if or repeat the process until acceptance. Here,  is a suitable multiplier such that for all .

An important ingredient to the rejection sampler is the parameter , which we choose as follows. In the case , the density has a heavier tail than  provided that . Focusing on the case , we have that for a given the smallest such that for all is

 Mδ=C⋅1δ2exp{γ2(1−δ)}.

Varying , the multiplier is minimized at

 δ=1+γ2−√γ4+8γ24.

If , then setting yields a heavy enough tail and .

The rejection sampling performs draws from the exact conditional distribution . We find it works very well for data with not too extreme contamination such as, for instance, in the original as well as bootstrap data from the application discussed in Section 7.2. When applied to data with very extreme observations , however, one is faced with larger positive values of . In this case the instrumental densities provide a poor approximation to the target density

, and the acceptance probabilities in the rejection sampling step become impractically low.

For , we thus use an alternative rejection procedure. Make the transformation . We then wish to sample from

 hγ(s)=2C(γ)⋅s3exp{−s2−s2γ}.

Any distribution has a heavier tail than the target distribution . While it is not possible to find an analytical solution for the optimal and , letting and yields acceptance probabilities between and for most plausible values of . Since this alternative procedure will only be needed occasionally, these acceptance problems are adequate. Using this hybrid approach yields overall acceptance probabilities greater than for the data with extreme contamination described in Section 7.1.

Returning to the iterations of the overall sampler, we calculate at the end of each cycle through the nodes, and then take the average over  iterations. This solves the problem of carrying out one E-step, and we obtain the following stochastic penalized EM algorithm, which we call the Monte Carlo -lasso (or -lasso for short):

E-step:

Given current estimates and , compute by averaging the matrices obtained in some large number of Gibbs sampler iterations, as described above.

M-step:

Calculate the updated estimates

 μ(t+1)j=∑ni=1τ(t+1)ijYij∑ni=1τ(t+1)ij.

Use these and to compute the matrix to be plugged into the trace term in (11). Maximize the resulting penalized log-likelihood function using the glasso.

### 5.3 Variational approximation

The above Monte Carlo procedure loses much of the computational efficiency of the classical tlasso from Section 4, however, and can be prohibitively expensive for large . For large problems, we turn instead to variational approximations of the conditional density  of the vector given the observed vector .

The variational approach proceeds by approximating the conditional density by a factoring distribution. In our context, however, it is easier to approximate the joint density instead. The first term is already in product form because we are assuming the individual divisor to be independent in the model formulation, and the second term is the density of the multivariate normal distribution

We approximate this normal distribution by a member of the set of multivariate normal distributions with diagonal covariance matrix. Application of this naive mean field procedure, that is, choosing a distribution by minimizing Kullback–Leibler divergence, leads to the approximating distribution

 Np(μ,D(1/√τ)¯Θ−1D(1/√τ)), (17)

where is the diagonal matrix with the same diagonal elements as [Wainwright and Jordan (2008), Chapter 5]. Writing for the density of the distribution in (17), our approximation thus has the fully factoring form . The resulting conditional distribution also factors as

 q∗(τi|Yi)=p∏j=1g(τij|Yij),

where is the density of the Gamma-distribution , with its parameters corresponding to the quantities and in (13).

In conclusion, the variational E-step consists of calculating, for each observation , the expectations

 Eg[τij|Yij]=αijβij,Eg[√τij|Yij]=Γ(αij+1/2)Γ(αij)√βij,

and . These values are then substituted into (11). The M-step is the same as in the -lasso.

The effect of the variational approximation is that the weight for node  in observation is based solely on the squared deviation from the mean, and the conditional variance . For a given deviation from the mean, the larger the conditional variance of the node, the smaller the weight given to that node in that observation. But unlike in the -lasso, no consideration is given to deviation from the conditional mean of the node in question given the rest. Some relevant information is therefore not being used, but in our simulations the effect was not noticeable.

The resulting variational -lasso (-lasso) is only slightly more expensive than the tlasso and, despite the relatively crude approximation in the variational E-step, performs well compared with the -lasso. Because of this, we will use exclusively the -lasso when considering the alternative model in the simulations in the next section.

## 6 Simulation results

### 6.1 Procedure

We used simulated data to compare the three procedures glasso, tlasso and -lasso as follows. We generated a random sparse inverse covariance (or dispersion) matrix according to the following procedure:

1. [(a)]

2. Choose each lower-triangular element of independently to be , or with probability , and , respectively.

3. For set .

4. Define where is the number of nonzero elements in the th row of .

The final step ensures a strictly diagonally dominant, and thus positive-definite matrix. To strengthen the partial correlations, we reduced the diagonal elements by a common factor. We made this factor as large as possible while maintaining positive-definiteness and stability for inversion. For these particular matrices, fixing a minimum eigenvalue of

worked well.

We then generated observations from the distribution and ran each of the three procedures with a range of values for the one-norm tuning parameter . To compare how well the competing methods recovered the true edges, we drew ROC curves. We ran this whole process 250 times and then repeated the entire computation, drawing data from and then distributions.

Simulating from -distributions produces extreme observations, but a more realistic setting might be one in which normal data is contaminated in some fashion. For instance, consider broken probes or misread observations in a large gene expression microarray. Suppose the contaminated data are not so extreme as to be manually screened or otherwise identified as obvious outliers. To simulate this phenomenon, we generate normal data as above, but randomly contaminated of the values with data generated from independent univariate random variables, where is equal to times the largest diagonal element of . These contaminated values will be similar in magnitude to the tail of the original distribution and therefore difficult to identify.

Finally, we would like to compare our developed -procedures with simpler approaches to robust inference. There are many ways to obtain robust estimates of the covariance matrix, but these usually require . Instead we obtain a robust estimate for the marginal covariances and variances using the procedure of Kalisch and Bühlmann (2008). Since this is not guaranteed to result in a positive definite matrix, we add a constant, , to the diagonal elements of the matrix, where is the minimum constant necessary to ensure the resulting matrix is nonnegative definite. We then use this robust estimate of the covariance matrix as input into the glasso and refer to this procedure as the robust glasso.

### 6.2 Results

Our tlasso and -lasso are computationally more expensive, since they call the glasso at each M-step. But in our simulations, the algorithms converge quickly. If we run through multiple increasing values of the tuning parameter for the one-norm penalty, it may take about EM iterations for the initial small value of , but only 2 or 3 iterations for later values, as we can “warm start” at the previous output. But even in the initial run, two iterations typically lead to a drastic improvement (in the likelihood) over the glasso.

The only caveat is that the function being maximized by the tlasso methods is not guaranteed to be unimodal. We thus started in several places, and let the algorithm run for longer than probably necessary in practice. We did not observe drastically different results from different starting places. Nonetheless, since we are not guaranteed to find a global maximum, the statistical performances of the tlasso and -lasso may, in principle, be understated here (and, of course, the computational efficiency overstated).

In the worst case scenario for our procedures relative to the glasso—when the data is normal and we assume -distributions with degrees of freedom—almost no statistical efficiency is lost. In the numerous simulations we have run using normal data, the tlasso and glasso do an essentially equally good job of recovering the true graph (see Figure 4). The -lasso performs surprisingly well at small to moderate false discovery rates. The robust glasso is based on a less efficient estimator and does not perform as well as the other procedures.

For data generated from a classical -distribution with degrees of freedom, the tlasso provides drastic improvement over the glasso at the low false positive rates that are of practical interest. The assumed normality and the occasional extreme observation lead to numerous false positives when using the glasso. Therefore, there is very little computational—and little or no statistical—downside to assuming -distributions, but significant statistical upside. Interestingly, the -lasso performs about as well as the tlasso. The robust glasso outperforms the purely Gaussian procedure at low false positive rates, since it is less susceptible to the most extreme observations.

In the third case, with data generated from the alternative -distribution with  degrees of freedom, only the -lasso is able to recover useful information without substantial noise. The occasional large values are too extreme for the normal model to explain and downweighting entire observations, as is done by the tlasso, discards too much information when there are extreme values scattered throughout the data. The robust glasso offers only a small improvement over the glasso.

With the contaminated data, the -lasso does not perform as well in this case as it does with data. The extreme values are not downweighted as much and, thus, the signals are noisier. It still performs far better, however, than either of the other methods, and is able to recover valuable information in a case where manual screening of outliers would be very difficult. The robust glasso does not perform as well as the -lasso, but offers a clear improvement over the glasso and might be a useful alternative.

### 6.3 Notes on simulation

The simulations show that the tlasso performs very similarly to the glasso even with normal data. While one would expect a model based on the -distribution to fare better with normal data than a normal model would with data, the fact that there is almost no statistical loss from the model misspecification is at first a bit surprising. The similarity of the results can be explained, however, by comparing the two procedures. In effect, the only difference is that the tlasso inputs a weighted sample covariance matrix into the glasso procedure; one can then think of the glasso as the tlasso with all weights set to one.

As noted in Section 3.2, these weights are the conditional expectations of , which are, from equation (5),

 τ(t+1)i=E[τi|Yi]=^ν+p^ν+δYi(μ(t),Ψ(t)), (18)

where is our estimate or assumption of the unknown degrees of freedom. If and , then is distributed according to the  distribution [Kotz and Nadarajah (2004), Chapter 3]. Thus, starting with the true values of and , the variance of the inverse weights is

 V[^ν+δY(μ,Ψ)^ν+p]=2pν2(p+ν−2)(ν−2)2(ν−4)(^ν+p)2.

For normal data (i.e., ), the variance is and goes to very quickly as gets large, no matter the assumed value of . If our current estimate of is reasonably close to the true , then the observations will likely have very similar weights and the weighted covariance matrix will be very close to the sample covariance matrix. For data, the above variance tends to for large ; so no matter how many variables we have, the distribution of the inverse weights will have positive variance and the tlasso and glasso estimates are less likely to agree.

## 7 Gene expression data

### 7.1 Galactose utilization

We consider data from microarray experiments with yeast strands [Gasch et al. (2000)]. As in Drton and Richardson (2008), we limit this illustration to genes involved in galactose utilization. An assumption of normality is brought into question, in particular, by the fact that in out of experiments with data for all 8 genes, the measurements for 4 of the genes were abnormally large negative values. In order to assess the impact of this handful of outliers, we run each algorithm, adjusting the penalty term such that a graph with a given number of edges is inferred. Somewhat arbitrarily we focus on the top 9 edges. We do this once with all 136 experiments and then again excluding the potential outliers.

As seen in Figure 5, the glasso infers very different graphs, with only 3 edges in common. When the “outliers” are included, the glasso estimate in Figure 5(a) has the 4 nodes in question fully connected; when they are excluded, no edges among the 4 nodes are inferred. The tlasso does not exhibit this extreme behavior. As seen in Figure 5(b), it recovers almost the same graph in each case (7 out of 9 edges shared). When run with all the data, the estimate is very small () for each of the questionable observations compared with the average estimate of . The graph in Figure 5(c) shows the results from the -lasso which performs just as well as the tlasso. The -lasso also recovered 7 edges in both graphs (not shown) and infers relationships similar to those found by the -lasso.

Figure 6 illustrates the flexibility of the weighting schemes of the various procedures. Both procedures downweight the 11 potential outliers observations for the 4 nodes in question, but not for the other nodes. Thus, the alternative version is able to extract information from the “uncontaminated” part of the observations while downweighting the rest. In this particular case, with 125 other observations, downweighting the outliers is of primary importance, and, thus, the increased flexibility of the -lasso over the tlasso does not make much of a difference in the inferred graphs. This might not be the case with a higher contamination level.

### 7.2 Isoprenoid pathway

We next consider gene expression data for the isoprenoid pathways of Arabidopsis thaliana discussed in Wille et al. (2004). Gene expressions were measured in 118 Affymetrix microarrays for 39 genes. While the data set described in the above section had clear deviations from normality, the data described in this section has no obvious deviations that stand out in exploratory plots.

Two approaches were considered in Wille et al. (2004). The first (GGM1) fit a Gaussian graphical model using BIC and backward selection to obtain a network with 178 edges. This number was deemed too large for interpretation, and the authors considered instead only the 31 edges found in at least of bootstrapped samples. The second approach (GGM2) tests the conditional independence of each pair of genes given a third gene. An edge is drawn only if a test of conditional independence is rejected for each other gene in the network. This approach is advocated in the paper and appears to find a network with better biological interpretation. The graph is shown in Figure 7, where shaded nodes indicate the so-called MEP pathway.

Our approach is modeled after GGM1. We used the -lasso and increasing values or to find a path of models to test. For each chosen model, we ran the -lasso again, but this time without penalty on the allowed edges. Since the likelihood is unavailable, we use leave-one-out cross-validation to find the model with the lowest mean squared prediction error. Since the exact conditionals from the alternative distribution are not available in explicit form, we perform the cross-validation as follows:

1. [(b)]

2. Estimate using all but one observation.

3. In the remaining observation, estimate the values of the latent normal variables for all but one of the coordinates in the same manner as the variational E-step of Section 5.3.

4. Predict the remaining normal value.

5. Scale the normal value by the expectation of .

We remark that we also experimented with leaving out a larger fraction of the observations as suggested in the work of Shao (1993), but this led to similar conclusions in the present example.

The cross-validation procedure gave a network with 122 edges. To reduce to the graph size found by GGM2, we took 500 bootstrapped samples of the data, fixing the parameter found in cross-validation, and only included those edges found in more than of the samples. For comparison, we also ran the above procedure using the glasso, but keeping of the samples to obtain the same-sized graph.

We believe our procedure infers a graph that compares favorably (in terms of biological interpretation) with that found by GGM2. Like GGM2, we find a connection between AACT2 and the group MK, MPDC1 and FPPS2; GGM1 found AACT2 to be disconnected from the rest of the graph despite its high correlation with these three genes. In the MEP pathway, our approach and GGM2 find similar structure; compare Figures 7 and 8.

While our approach finds the key relationships identified in Wille et al., it achieves this with fewer “cross-talk” edges between the two pathways. The authors discuss plausible interpretations for such interactions between the pathways, but a graph with less cross-talk might be closer to the scientists’ original expectation (Figures 7 and 8). It is worth noting that the glasso procedure performs better than GGM1, with edge inclusion being far less sensitive to the particular bootstrapped sample. The glasso also finds the key relationships of GGM2. We also ran the tlasso, which gave results similar to the glasso and with the -lasso, which behaved similar to the -lasso. We do not show these results here.

## 8 Discussion

Our proposed tlasso and -lasso algorithms are simple and effective methods for robust inference in graphical models. Only slightly more computationally expensive than the glasso, they can offer great gains in statistical efficiency. The alternative t distribution is more flexible than the classical and is generally preferred. We find that the simple variational E-step is an efficient way to estimate the graph in the alternative case, but also explored more sophisticated Monte Carlo approximations.

We assumed degrees of freedom in our various tlasso and -lasso runs. As suggested in prior work on -distribution models, estimation of the degrees of freedom can be done efficiently by a line search based on the observed log-likelihood function in the classical model.

In the alternative model, the choice of puts an explicit upper bound on the maximum correlation between two variables, the upper bound increasing quickly with (see Figure 9). This makes inference of the degrees of freedom potentially more relevant than with the classical model, as an alternative model with small might not be a good fit for highly correlated variables. In order to select , a line search based on the hidden log-likelihood function can be employed. For further flexibility, we may also allow the degrees of freedom to vary with each node. That is, we could let the divisors be independent -divisors with possible different degrees of freedom . This leads to similar conditionals in the Gibbs sampler and the resulting procedure is thus no more complicated. Nevertheless, for the purposes of graph estimation, our experience and related literature suggest that not much is lost by considering only a few small values for the degrees of freedom. For instance, running the -lasso procedure in Section 7.2 using produces a very similar result with one additional cross-talk edge.

In the last section we used cross-validation to choose the one-norm tuning parameter . The likelihood is not available explicitly for the -distribution and so we cannot easily use information criteria for the -lasso. Cross-validation often tends to pick more edges than is desirable, however, when the goal is inference of the graph and not optimal prediction. An interesting but potentially difficult problem for future research would be to develop rules for choosing that control an edge inclusion error rate; compare Banerjee, El Ghaoui and d’Aspremont (2008); Meinshausen and Bühlmann (2006).

Throughout the paper, we have penalized all the elements of . One alternative is to remove the penalty from the diagonal elements of , since we expect all these to be nonzero. This leads to smaller estimated partial correlations, and we found it to result in less stable behavior of the tlasso in the sense of the number of edges decreasing rather suddenly as increases.

Finally, we remark that other normal scale-mixture models could be treated in a similar fashion as the -distribution models we considered in this paper. However, the use of -distributions is particularly convenient in that it is rather robust to various types of misspecification, involves only the choice of the degrees of freedom parameters for the distribution of Gamma-divisors, and maintains good efficiency when data are Gaussian.

## Appendix

Proof of Proposition 1 According to standard graphical model theory [Lauritzen (1996)], it suffices to show that and are conditionally uncorrelated given . Partition into and . For a given value of ,

 (Ya|Yb,τ)∼N2(μa−Θ−1a,aΘa,b(Yb−μb),Θ−1a,a/τ)

and

 (Yj|Yk∪b,τ)∼N(μi−θ−1jjΘj,k∪b(Yk∪b−μk∪b),θ−1jj/τ).

Since ,

 E[Yj|Yk∪b,τ]=μj−θ−1jjΘj,b(Yb−μb)=E[Yj|Yb,τ]

for any value of . Therefore,

 E[Yj|Yk∪b]=E[E[Yj|Yk∪b,τ]|Yk∪b]=E[E[Yj|Yb,τ]|Yb]=E[Yj|Yb],

which implies that and are conditionally uncorrelated given .

## References

• Banerjee, El Ghaoui and d’Aspremont (2008) [author] Banerjee, O.O., El Ghaoui, L.L. d’Aspremont, A.A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516. 2417243
• Drton and Perlman (2007) [author] Drton, M.M. Perlman, D.D. (2007). Multiple testing and error control in Gaussian graphical model selection. Statist. Sci. 22 430–449. 2416818
• Drton and Richardson (2008) [author] Drton, M.M. Richardson, T.T. (2008). Graphical methods for efficient likelihood inference in Gaussian covariance models. J. Mach. Learn. Res. 9 893–914. 2417257
• Friedman (1997)

[author] Friedman, N.N. (1997). Learning belief networks in the presence of missing values and hidden variables. In Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN.

• Friedman, Hastie and Tibshirani (2008) [author] Friedman, J.J., Hastie, T.T. Tibshirani, R.R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
• Gasch et al. (2000) [author] Gasch, A. P.A. P., Spellman, P. T.P. T., Kao, C. M.C. M., Carmel-Harel, O.O., Eisen, M. B.M. B., Storz, S.G., Botstein, D.D. Brown, P. O.P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11 4241–4257.
• Kalisch and Bühlmann (2008) [author] Kalisch, M.M. Bühlmann, P.P. (2008). Robustification of the PC-algorithm for directed acyclic graphs. J. Comput. Graph. Statist. 17 773–789. 2649066
• Kotz and Nadarajah (2004) [author] Kotz, S.S. Nadarajah, S.S. (2004). Multivariate -Distributions and Their Applications. Cambridge Univ. Press, Cambridge. 2038227
• Lauritzen (1996) [author] Lauritzen, Steffen L.S. L. (1996). Graphical Models. Oxford Univ. Press, New York. 1419991
• Liu and Rubin (1995) [author] Liu, C.C. Rubin, D. B.D. B. (1995). ML estimation of the distribution using EM and its extensions, ECM and ECME. Statist. Sinica 5 19–39. 1329287
• McLachlan and Krishnan (1997) [author] McLachlan, G.G. Krishnan, T.T. (1997). The EM Algorithm and Extensions. Wiley, New York. 1417721
• Meinshausen and Bühlmann (2006) [author] Meinshausen, N.N. Bühlmann, P.P. (2006). High dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. 2278363
• Miyamura and Kano (2006) [author] Miyamura, M.M. Kano, Y.Y. (2006). Robust Gaussian graphical modeling. J. Multivariate Anal. 97 1525–1550. 2275418
• Ravikumar et al. (2009) [author] Ravikumar, P.P., Raskutti, G.G., Wainwright, M. J.M. J. Yu, B.B. (2009). Model selection in Gaussian graphical models: High-dimensional consistency of -regularized MLE. In Advances in Neural Information Processing Systems (NIPS) (D. Koller, D. Schuurmans, Y. Bengio and L. Botto, eds.) 21 1329–1336.
• Shao (1993) [author] Shao, J.J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494. 1224373
• Wainwright and Jordan (2008) [author] Wainwright, M. J.M. J. Jordan, M. I.M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 1–305.
• Wille et al. (2004) [author] Wille, AnjaA., Zimmermann, PhilipP., Vranova, EvaE., Furholz, AndreasA., Laule, OliverO., Bleuler, StefanS., Hennig, LarsL., Prelic, AmelaA., von Rohr, PeterP., Thiele, LotharL., Zitzler, EckartE., Gruissem, WilhelmW. Bühlmann, PeterP. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 5 R92.
• Yuan and Lin (2007) [author] Yuan, M.M. Lin, Y.Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35. 2367824