EM algorithm, Gaussian graphical model, mouse genomics, shrinkage, sparsity, variable selection
Gaussian graphical models are widely used to represent conditional dependencies among sets of normally distributed outcome variables that are observed together. For example, observed, and potentially dense, correlations between measurements of expression for multiple genes, stock market prices of different asset classes, or blood flow for multiple voxels in functional magnetic resonance imaging, i.e., fMRI-measured brain activity, can often be more parsimoniously explained by an underlying graph whose structure may be relatively sparse. As methods for estimating these underlying graphs have matured, a number of elaborations to basic Gaussian graphical models have been proposed, including those that seek either to model the sampling distribution of the data more closely, or to model prior expectations of the analyst about structural similarities among graphs representing related data sets(Guo et al., 2011; Danaher et al., 2014; Lee & Liu, 2015). In this paper, we propose an elaboration that seeks to model an additional feature of the sampling distribution increasingly encountered in biomedical data, whereby correlations among the outcome variables are considered to be the byproduct of underlying conditional dependencies acting at different levels. For illustration, consider gene expression data obtained from multiple tissues, such as liver, kidney, and brain, collected on each individual. In this setting, observed correlations between expressed genes may be caused by dependence structures not only within a specific tissue but also across tissues at the level of the whole body. We describe these distinct graphical strata respectively as the category-specific and systemic layers, and model their respective outcomes as latent variables.
The conditional dependence relationships among outcome variables, , can be represented by a graph , where each variable is a node in the set and conditional dependencies are represented by the edges in the set
. If the joint distribution of the outcome variables is multivariate Gaussian,, then conditional dependencies are reflected in the non-zero entries of the precision matrix . Specifically, two variables and are conditionally independent given the other variables if and only if the th entry of is zero. Inferring the dependence structure of such a Gaussian graphical model is thus the same as estimating which elements of its precision matrix are non-zero.
When the underlying graph is sparse, as is often assumed, the maximum likelihood estimator is dominated in terms of false positive rate by shrinkage estimators. The maximum likelihood estimate of typically implies a graph that is fully connected, which is unhelpful for estimating graph topology. To impose sparsity, and thereby provide a more informative inference about network structure, a number of methods have been introduced that estimate under regularization. Meinshausen & Bühlmann (2006) proposed to iteratively determine the edges of each node in by fitting an penalized regression model to the corresponding variable using the remaining variables as predictors, an approach which can be viewed as optimizing a pseudo-likelihood (Ambroise et al., 2009; Peng et al., 2009). More recently, numerous papers have proposed estimation using sparse penalized maximum likelihood (Yuan & Lin, 2007; Banerjee et al., 2008; d’Aspremont et al., 2008; Rothman et al., 2008; Ravikumar et al., 2011). Efficient implementations include the graphical lasso algorithm (Friedman et al., 2008) and the quadratic inverse covariance algorithm (Hsieh et al., 2014). The convergence rate and selection consistency of such penalized estimation schemes have also been investigated in theoretical studies (Rothman et al., 2008; Lam & Fan, 2009).
Although a single graph provides a useful representation of an underlying dependence structure, several extensions have been proposed. In the context where the precision matrix, and hence the graph, is dynamic over time, Zhou et al. (2010) proposed a weighted method to estimate the graph’s temporal evolution. Another extension is to simultaneously estimate multiple graphs that may share some common structure. For example, when inferring how brain regions interact using fMRI data, each subject’s brain corresponds to a different graph, but we would nonetheless expect some common interaction patterns across subjects, as well as patterns specific to an individual. In such cases, joint estimation of multiple related graphs can be more efficient than estimating graphs separately. For joint estimation of Gaussian graphs, Varoquaux et al. (2010) and Honorio & Samaras (2010) proposed methods using group lasso and multitask lasso, respectively. Both assume that all the precision matrices have the same pattern of zeros. To provide greater flexibility, Guo et al. (2011) proposed a joint penalized method using a hierarchical penalty, and derived the convergence rate and sparsistency properties for the resulting estimators. In the same setting, Danaher et al. (2014) extended the graphical lasso (Friedman et al., 2008) to estimate multiple graphs from independent data sets using penalties based on the generalized fused lasso or, alternatively, the sparse group lasso.
The above methods for estimating multiple Gaussian graphs focus on the settings in which data collected from different categories are stochastically independent. In some applications, however, data from different categories are more naturally considered as dependent. In a study considered here, gene expression data have been collected on multiple tissues in multiple mice. For each mouse we have expression measurements for genes in each of different tissues, that is, different categories, represented by the
-dimensional vectors. In this setting, the gene expression profiles of different mice may have arisen from the same network structure, but they are otherwise stochastically independent; in contrast, the gene expression profiles of different tissues within the same mouse are stochastically dependent. For such data, increasingly common in biomedical research, the above methods are not applicable.
To explore the gene network structure across different tissues, and to characterize the dependence among tissues, we consider a decomposition of the observed gene expression into two latent vectors. In our model, we define
where are mutually independent. Because for any , represents dependence across different tissues. Letting denote the precision matrix of for tissue , and defining , we aim to estimate from the observed outcome data on . To accomplish this joint estimation of multiple dependent networks, we propose a one-step method and an EM method.
In the above decomposition, can be viewed as representing systemic variation in gene expression, that is, variation manifesting simultaneously in all measured tissues of the same mouse, whereas represents category-specific variation, that is, variation unique to tissue . An important property of this two-layer model is that sparsity in the systemic and category-specific networks can produce networks for the outcome variable that is highly connected. Conversely, highly connected graphs for the outcome can easily arise from relatively sparse underlying dependencies acting at two levels. This phenomenon is illustrated in Fig. 1, which depicts category-specific networks and for two categories and , which might correspond to, for example, liver and brain tissue-types, and a systemic network , which reflects relationships affecting all tissues at once, for example, gene interactions that are responsive to hormone levels or other globally-acting processes. Although all three underlying networks, , and , are sparse, the precision matrix of observed variables within each tissue, that is, the aggregate network following (1) is highly connected. Existing methods aiming to estimate a single sparse network layer are therefore ill-suited to this problem because they impose sparsity on the aggregate network rather than on the two simpler layers that generate it.
2.1 Problem formulation
The following notation is used throughout the paper. We denote the true precision and covariance matrices by and . For any matrix , we denote the determinant by , the trace by and the off-diagonal entries of by . We further denote the
th eigenvalue ofby , and the minimum and maximum eigenvalues of by and . The Frobenius norm, , is defined as ; the operator/spectral norm, , is defined as ; the infinity norm, , is defined as ; and the element-wise norm, , is defined as .
In the problem, we address, measurements are available on the same outcome variables in each of distinct categories on each of individuals. Some dependence is anticipated among outcomes both at the level of the category and at the level of the individual: dependence at the level of the category is described as category-specific; dependence at the level of the individual is described as systemic, that is, modeled as if affecting outcomes in all categories of the same individual simultaneously. Our primary example is the measurement of gene expression giving rise to transcript abundance readings on genes on tissues, such as liver, kidney and brain, in laboratory mice.
Letting be the th data vector for the th category, we model
where is the random vector corresponding to the shared systemic random effect, and is the random vector corresponding to the th category. We assume that and are independent and identically distributed -dimensional random vectors with mean , and covariance matrices and respectively. We further assume that , and
are independent of each other and each follows a multivariate Gaussian distribution.
For the th sample in the th category, we observe the -dimensional realization of , vector . Without loss of generality, we assume these observations are centered, i.e., ). Let be the combined data vector with , such that follows a Gaussian distribution with covariance , where is a block diagonal matrix, is a square matrix with all as the entries, is the Kronecker product, and is the covariance matrix between and . We denote the by dimensional data matrix by , and let , and . Our goal is to estimate . Although and are latent variables, we can show that is identifiable under the model setup in (2) with . More details can be found in the Supplementary Material. For simplicity, we denote and as and respectively in the following derivation.
The log-likelihood of the data can be written as
is the sample covariance matrix. In our setting,
where ; see the Supplementary Material for details.
A natural way to obtain a sparse estimate of is to maximize the penalized log-likelihood
Because the likelihood is complicated, direct estimation of the precision matrices in (4) is difficult. Estimation can proceed directly, however, given the values of the latent outcome vector . Therefore, we first estimate and then the other parameters. In Sections 2.2 and 2.3, we consider estimation of these multiple dependent graphs using a one-step procedure and a method based on the EM algorithm.
2.2 One-step method
The idea behind our one-step method is to generate a good initial estimate for and then obtain estimates for by one-step optimization. Because , for any , it is natural to use the covariance matrix between all pairs of and to estimate by
Using the fact that , we can then obtain an estimate for as
Although is symmetric, it may not be positive semidefinite, but this can be ensured using projection (Xu & Shao, 2012). For any symmetric matrix , the positive-semidefinite projection is
Lastly, we estimate by minimizing separate functions,
where when , and otherwise. The minimization of (8) can be solved efficiently by algorithms such as the graphical lasso (Friedman et al., 2008) or by the quadratic inverse covariance algorithm (Hsieh et al., 2014). We name this approach as the one-step method and later compare its performance with the EM method defined next.
2.3 Graphical EM method
The one-step method provides an estimate of . In the spirit of the classic EM algorithm (Dempster et al., 1977), this estimate of can be used to obtain a better estimate of , which in turn can be used to obtain a better estimate of . This procedure is iterated until the estimates of converge, leading to a graphical EM algorithm, described below.
First, we rewrite the sampling model as
and the log-likelihood given and as
Expression (2.3) cannot be calculated directly because and are unobserved. However, we can replace them with their expected values conditional on and , and develop the EM algorithm with the following steps:
Update the expectation of the log-likelihood conditional on using
Update that maximizes
where denotes the estimates from the th iteration, denotes the conditional expectation with respect to given , and
where is an estimator for , the true covariance matrix of . Therefore, problem (10) is decomposed into separate optimization problems:
We summarize the proposed EM method in the following steps:
The next proposition demonstrates convergence of our graphical EM algorithm.
2.4 Model selection
We consider two options for selecting the tuning parameter , minimization of the extended Bayesian information criterion (Chen & Chen, 2008), and cross-validation. The extended Bayesian information criterion is quick to compute and takes into account both goodness of fit and model complexity. Cross-validation, by contrast, is more computationally demanding and focuses on the predictive power of the model.
In our model, we define the extended Bayesian information criterion
where are the estimates with the tuning parameter set at ,
is the log-likelihood function, the degrees of freedomis the sum of the number of non-zero off-diagonal elements on , and is the number of models with size , which equals choose . This criterion is indexed by a parameter . The tuning parameter is selected as .
In describing the cross-validation procedure, we define the predictive negative log-likelihood function as To select using cross-validation, we randomly split the dataset equally into groups, and denote the sample covariance matrix from the th group as and the precision matrix estimated from the remaining groups as . Then we choose
The performance of these two selection methods is reported in Section 4.
3 Asymptotic properties
We introduce some notations and the regularity conditions. Let be the true precision matrices with , be the index set corresponding to the nonzero off-diagonal entries in , be the cardinality of , and . Let be the true covariance matrices for and , and be the true covariance matrices for . We assume that the following regularity conditions hold.
Condition 1. There exist constants and such that .
Condition 2. There exist constants and such that
Condition 1 bounds the eigenvalues of , thereby guaranteeing the existence of its inverse. Condition 2 is needed to facilitate the proof of consistency. The following theorems discuss estimation consistency and selection sparsistency of our methods.
Theorem 3.1 (Consistency of the one-step method).
Under Conditions 1 and 2, if , then the solution of the one-step method satisfies
We next present a corollary of Theorem 3.1 which gives a good estimator of .
Under the assumptions of Theorem 3.1 and being the one-step solution, satisfies
To study our EM estimator, we need an estimator for that satisfies the following condition.
Condition 3. We assume there exists an estimator such that
The rate in Condition 3 is required to control the convergence rate of the E-step estimating and thus the consistency of the estimate from the EM method. Under the conditions in Theorem 3.1, we can use the one-step estimator to obtain , where is a block diagonal matrix. The resulting satisfies Condition 3 by Corollary 3.2.
Theorem 3.3 (Consistency of the EM method).
If Conditions 1-3 hold and , then after a finite number of iterations, the solution of the EM method satisfies
Theorem 3.4 (Sparsistency of the one-step method).
For sparsistency we require a lower bound on the rates of and , but for consistency, we need an upper bound for and to control the biases. In order to have consistency and sparsistency simultaneously, we need the bounds to be compatible, that is, we need . From the inequalities , there are two extreme scenarios describing the rate of , as discussed in Lam & Fan (2009). In the worst case, where has the same rate as , we achieve both consistency and sparsistency only when . In the most optimistic case, where , we have , and the compatibility of the bounds requires .
Theorem 3.5 (Sparsistency of the EM method).
Under the assumptions of Theorem 3.3, if we further assume the EM solution satisfies for a sequence , and if , then with probability tending to 1, for all
Similar to the discussion above for the EM algorithm, we obtain both consistency and sparsistency when . See the Supplementary Material.
4.1 Simulating category-specific and systemic networks
We assessed the performance of the one-step and EM methods by applying them to simulated data generated by two types of synthetic networks: a chain network and a nearest-neighbor network as shown in Fig. 2. Twelve simulation settings were considered. These varied the base architecture of the category-specific network, the degree to which the actual structure could deviate from this base architecture, and the number of outcome variables.
Under each of the 12 simulation conditions, samples were independently and identically distributed, with systemic outcomes generated as , category-specific outcomes as , and observed outcomes as , for , and . The following architectures were considered for the five networks :
(I) the category-specific networks are chain-networks and the systemic network is a nearest-neighbor network with the number of neighbors and for and ;
(II) the category-specific networks and the systemic network are all nearest-neighbor networks with and for and respectively.
Chain and nearest-neighbor networks were generated using the algorithms in Fan et al. (2009) and Li & Guo (2006). The structures of network (I) are shown in Fig. 2(a) and (d). Simulated networks were allowed to deviate from their base architectures by a specified degree , through a random addition of edges following the method of Guo et al. (2011). Specifically, for each generated above, a symmetric pair of zero elements is randomly selected and replaced with a value generated uniformly from . We repeat this procedure times, with being the number of links in the initial structure, and .
We compared the performance of the one-step and EM methods by examining the average false positive rate, average false negative rate, average Hamming distance, average entropy loss
and average Frobenius loss
We also examined receiver operating curves for the two methods.
4.2 Estimation of category-specific and systemic networks
As shown in Fig. 1, existing methods are designed to estimate the aggregate networks instead of category-specific and systemic networks. In this subsection, we focus only on our proposed one-step and EM methods.
Results of the simulations are reported in Table 1. Summary statistics are based on replicate trials under each of the 12 conditions, and given for model fitting under both extended Bayesian information criterion with and under cross-validation criteria. In general, the one-step method under either model selection criteria resulted in higher values of entropy loss, Frobenius loss, false positive rates and Hamming distance. For both methods, cross-validation tended to choose models with more false positive links but fewer false negative links, leading to a denser graph. For model selection, a rule of thumb is to use cross-validation when , and to use the extended Bayesian information criterion otherwise.
Receiver operating curves for the one-step and EM methods are plotted in Fig. 3; each is based on 100 replications with the constraint . Under all settings, the EM method outperforms the one-step method, yielding greater improvements as the structures become more complicated.
4.3 Estimation of aggregate networks
Although our goal is to estimate the two network layers, we can also use our estimators of to estimate the aggregate network as a derived statistic. Doing so allows us to compare our method with methods that aim to estimate the aggregate network , these methods otherwise being incomparable.
We compared the performance of the EM method with two existing single-level methods for estimating multiple graphs: the hierarchical penalized likelihood method of Guo et al. (2011), and the joint graphical lasso of Danaher et al. (2014). As shown by simulation results reported in the Supplementary Material, these two single-level methods tended to give similar, sparse estimates that were very different from the true aggregate graph. The true aggregate graph tended to be highly connected, as illustrated in Fig 1, and under most settings was much better estimated by the EM. An exception was setting (II) with and , where is relatively sparse, and where the best performance came from the method of Guo et al. (2011). Sparsity in arises under this setting because when and are chain networks has a strong banding structure, with large absolute values within the band and small absolute values outside.
5 Application to gene expression data in mouse
To demonstrate the potential utility of our approach, we apply the EM method to mouse genomics data from Dobrin et al. (2009) and Crowley et al. (2015). In each case, we aim to infer systemic and category-specific gene co-expression networks from transcript abundance as measured by microarrays. In describing our inference on these datasets we find it helpful to distinguish two interpretations of a network: the potential network is the network of biologically possible interactions in the type of system under study; the induced network is the subgraph of the potential network that could be inferred in the population sampled by the study. The induced network is therefore a statistical, not physical, phenomenon, and describes the dependence structure induced by the interventions, or perturbations, applied to the system.
A simple example is the relationship between caloric intake, sex, and body weight. Body weight is influenced by both the state of being male or female and the degree of calorie consumption; these relations constitute edges in the potential network. Yet in a population where caloric intake varies but where individuals are exclusively male, the effect of sex is undefined and the corresponding edges relating sex to body weight are undetectable; these edges are therefore absent in the induced network. More generally, the induced network for a system is defined both by the potential network and the intervention applied to it: two populations of mice could have the same potential network, but when subject to different interventions their induced networks could differ. Conversely, when estimating the dependence structure of variables arising from population data, the degree to which the induced network reflects the potential network is a function of the underlying conditions being varied and interventions at work.
The Dobrin et al. (2009) dataset comprises expression measurements for 23,698 transcripts on 301 male mice in adipose, liver, brain and muscle tissues. These mice arose from an cross between two contrasting inbred founder strains, one with normal body weight physiology and the other with a heritable tendency for rapid weight-gain. In a cross of this type, the analyzed offspring constitutes an independent and identically distributed sample of individuals who are genetically distinct and have effectively been subject to a randomized allocation of normal and weight-inducing DNA variants, or alleles, at multiple locations along its genome. As a result of this allocation, gene expression networks inferred on such a population would be expected to emphasize more strongly those subgraphs of the underlying potential network that are related to body weight. Moreover, since the intervention alters a property affecting the entire individual, we might expect it to exert at least some of its effect systemically, that is, globally across all tissues in each individual.
Using a subset of the data, we inferred the dependence structure of gene co-expression among three groups of well-annotated genes in brain and liver: an obesity-related gene set, an imprinting-related gene set, and an extracellular matrix, i.e., the ECM-related gene set. These groups were chosen based on criteria independent of our analysis and represent three groups whose respective effects would be exaggerated under very different interventions. The tissue-specific and systemic networks inferred by our EM method are shown in Fig. 4. Each node represents a gene, and the darkness of an edge represents the magnitude of the associated partial correlation. The systemic network in Fig. 4(c) includes edges on the Aif1 obesity-related pathway only, which is consistent with the exhibiting a dependence structure induced primarily by an obesity-related genetic intervention that acts systemically. The category-specific networks in Fig. 4(a) and (b) still include part of the Aif1 pathway, suggesting that variation in this pathway tracks variation at both the systemic and tissue-specific level; in other ways their dependence structures differ, with, for instance, Aif1 and Rgl2 being linked in the brain but not in the liver. The original analysis of Dobrin et al. (2009) used a correlation network approach, whereby unconditional correlations with statistical significance above a predefined threshold were declared as edges; that analysis also supported a role for Aif1 in tissue-to-tissue co-expression.
The Crowley et al. (2015) data comprise expression measurements of transcripts in brain, liver, lung and kidney tissues in mice arising from three independent reciprocal crosses. A reciprocal F1 cross between two inbred strains A and B generates two sub-populations: the progeny of strain A females mated to strain B males denoted by AxB, and the progeny of strain B females to strain A males denoted by BxA. Across the two progeny groups, the set of alleles inherited is identical, with each mouse having inheriting half of its alleles from A and the other half from B; but the route through which those alleles were inherited differs, with, for example, AxB offspring inheriting their A alleles only from their fathers and BxA inheriting them only from their mothers. The underlying intervention in a reciprocal cross is therefore not the varying of genetics as such but the varying of parent-of-origin, or epigenetics, and so we might expect some of this epigenetic effect to manifest across all tissues.
We applied our EM method to a normalized subset of the Crowley et al. (2015) data, restricting attention to brain and liver, and removing gross effects of genetic background. Our analysis identified three edges on the systemic network as shown in Fig. 5(c) that include the genes Igf2, Tab1, Nrk and Pde4b, all from the imprinting-related set implicated in mediating epigenetic effects. Thus, the inferred patterns of systemic-level gene relationships in the two studies coincide with the underlying interventions implied by the structure of those studies, with genes affecting body weight in the Dobrin et al. (2009) data and genes affected by parent-of-origin in the Crowley et al. (2015) data.
To demonstrate the use of our method for higher dimensional data, we examined a larger subset of genes from Dobrin et al. (2009). Selecting the genes that had the largest within-group variance among the four tissues in the population, we applied our graphical EM method, using the extended Bayesian information criterion to select the tuning parameters and . The existence of a single, non-zero systemic layer for these data was strongly supported by significance testing, as described in the Supplementary Material. The topologies of the estimated tissue-specific and systemic networks are shown in Fig. 6(a-d), with a zoomed-in view of the edges of the systemic network shown in Fig. 6(f). The systemic network is sparse, with 249 edges connecting 62 of the 1000 genes in Fig. 6(e); this sparsity may reflect there being few interactions simultaneously occurring across all tissues in this population, with one contributing reason being that some genes are being expressed primarily in one tissue and not others. The systemic network also includes a connection between two genes, Ifi44 and H2-Eb1, that are members of the Aif1 network of Fig. 4. To characterize more broadly the genes identified in the systemic network, we conducted an analysis of gene ontology enrichment (Shamir et al., 2005) in which the distribution of gene ontology terms associated with connected genes in the systemic network was contrasted against the background distribution of gene ontology terms in the entire 1000-gene set; this showed that the systemic network is significantly enriched for genes associated with immune and metabolic processes, which accords with recent studies linking obesity to strong negative impacts on immune response to infection (Milner & Beck, 2012; Lumeng, 2013). The original study of Dobrin et al. (2009) also showed that the enrichment of inflammatory response processes in co-expression from liver and adipose, again using unconditional correlations.
In this paper we consider joint estimation of a two-layer Gaussian graphical model that is different from but related to the single-layer model. In our setting, the single-layer model estimates an aggregate graph by imposing sparsity on directly. Our model, by contrast, estimates the two graphical layers that compose the aggregate, namely and , and imposes sparsity on each. This can imply an aggregate graph that is less sparse; but this is appropriate because in our setting is a byproduct and, as such, is a secondary subject of inference. Importantly, our two-layer model includes the single-layer model as a special case, since in the absence of an appreciable systemic dependence, when , the two-layer model reduces to a single layer.
Our model lends itself to several immediate extensions. First, we currently assume that the systemic graph affects all tissues equally, but, as suggested by one reviewer, we can extend our model to allow the influence of the systemic layer to vary among tissues. For example, since muscle and adipose are both developed from the mesoderm, we might expect them to be more closely related to each other as compared with the pancreas, which is developed from the endoderm. We can accommodate such variation in our model as:
where quantifies the level of systemic influence in each tissue . Our EM algorithm can also be modified to calculate and . More details can be found in the Supplementary Material.
Second, we can extend the penalized maximum likelihood framework to other nonconvex penalties such as the truncated -function (Shen et al., 2012) and the smoothly clipped absolute deviation penalty (Fan & Li, 2001). Furthermore, we believe it would be both practicable and useful to extend these methods beyond the Gaussian assumption (Cai & Liu, 2011; Liu et al., 2012; Xue & Zou, 2012).
The authors thank the editor, the associate editor and two reviewers for their helpful suggestions. This work was supported in part by the U.S. National Institutes of Health and the National Science Foundation. Yuying Xie is also affiliated with Department of Statistics and Probability at Michigan State University. Yufeng Liu is affiliated with Department of Genetics and Carolina Center for Genome Sciences, and both he and William Valdar are also affiliated with Department of Biostatistics and the Lineberger Comprehensive Cancer Center at the University of North Carolina.
Appendix A Derivation of the likelihood
For simplicity, we write for in the following derivation. To derive the log-likelihood of , which is expressed as
we first state Sylvester’s determinant theorem.
Lemma A.1 (Sylvester theorem).
If , are matrices of sizes and , respectively, then
where is the identity matrix of order a.
is the identity matrix of order a.
Since follows a variate Gaussian distribution with mean and covariance matrix , we have . In addition, we can derive from the joint probability by integrating out as follows:
where . We then expand the formula and have