Estimating a common covariance matrix for network meta-analysis of gene expression datasets in diffuse large B-cell lymphoma

The estimation of covariance matrices of gene expressions has many applications in cancer systems biology. Many gene expression studies, however, are hampered by low sample size and it has therefore become popular to increase sample size by collecting gene expression data across studies. Motivated by the traditional meta-analysis using random effects models, we present a hierarchical random covariance model and use it for the meta-analysis of gene correlation networks across 11 large-scale gene expression studies of diffuse large B-cell lymphoma (DLBCL). We suggest to use a maximum likelihood estimator for the underlying common covariance matrix and introduce an EM algorithm for estimation. By simulation experiments comparing the estimated covariance matrices by cophenetic correlation and Kullback-Leibler divergence the suggested estimator showed to perform better or not worse than a simple pooled estimator. In a posthoc analysis of the estimated common covariance matrix for the DLBCL data we were able to identify novel biologically meaningful gene correlation networks with eigengenes of prognostic value. In conclusion, the method seems to provide a generally applicable framework for meta-analysis, when multiple features are measured and believed to share a common covariance matrix obscured by study dependent noise.


page 1

page 2

page 3

page 4


Core Shrinkage Covariance Estimation for Matrix-variate Data

A separable covariance model for a random matrix provides a parsimonious...

Estimation of large block covariance matrices: Application to the analysis of gene expression data

Motivated by an application in molecular biology, we propose a novel, ef...

The covariance shift (C-SHIFT) algorithm for normalizing biological data

Omics technologies are powerful tools for analyzing patterns in gene exp...

Covariance-based sample selection for heterogenous data: Applications to gene expression and autism risk gene detection

Risk for autism can be influenced by genetic mutations in hundreds of ge...

SMAGEXP: a galaxy tool suite for transcriptomics data meta-analysis

Bakground: With the proliferation of available microarray and high throu...

Fused inverse-normal method for integrated differential expression analysis of RNA-seq data

Use of next-generation sequencing technologies to transcriptomics (RNA-s...

On a Possible Similarity between Gene and Semantic Networks

In several domains such as linguistics, molecular biology or social scie...

1 Introduction

Genes and their products carry out their functions in concerted effort via intricate networks. Network analysis of genomic data has consequently been an inescapable success by elucidating gene functions and interactions through many different approaches. For that reason, we wished to apply network reconstruction to gene expression data from diffuse large B-cell lymphoma samples from our own lab (Dybkær et al., 2015) as others (Agnelli et al., 2011; Clarke et al., 2013) have successfully used network analysis in other malignancies. However, the methods have one or more of several shortcomings, particularly, as in our case, when the sample size is moderate.

Crudely, many network analysis methods—such as relevance or correlation networks (Horvath, 2011), Gaussian graphical models (Lauritzen, 1996)

, and Bayesian networks—may be categorized according to their use of marginal, partial, or semi-partial correlations. The underlying assumption of them all is that correlation hints at causation. The different measures of correlation separate the methods by different attempts to remove the influence of remaining other genes

(Markowetz and Spang, 2007). As the covariance matrix and its inverse holds this information, its estimation is a central topic to network analysis. However, this fundamental problem of accurately estimating the covariance matrix (or its inverse, the precision matrix) is notoriously difficult. The usual bias-corrected maximum likelihood estimator (MLE), the sample covariance matrix, has long been known to perform poorly in general due to high variability (Dempster, 1972). The sample covariance becomes increasingly ill-conditioned as the number of variables approaches the sample size and singular when exceeds

. Because of its central statistical role, the list of statistical methods and applications utilizing the estimated covariance matrix is exceedingly long. Beside the many standard statistical methods such as principal component analysis (PCA) and discriminant analysis, examples of direct applications include gene and protein network analysis

(Butte et al., 2000; Margolin et al., 2006), spectroscopic imaging (Lin et al., 2007), functional magnetic resonance imaging (fMRI), financial forecasting, and many more. Among this expanding list of applications is also an increasing number of high-dimensional applications and datasets publicly available in online repositories.

In genomic datasets, the number of genes often far exceeds the number of samples . Since the number of parameters increases quadratically in and the sample covariance matrix becomes singular when

many shrinkage and regularization estimators have been proposed to combat the accompanying problems by effectively increasing the degrees of freedom. These examples include the graphical lasso and ridge estimation of the precision matrix

(Meinshausen and Bühlmann, 2006; Friedman, Hastie and Tibshirani, 2008; van Wieringen and Peeters, 2016). Subsequently, focus on the selection of the regularization parameter has received much attention (Fan and Tang, 2013).

Instead of attempting to derive still more sophisticated estimators, we attempt to alleviate the problem from a different angle by using more available data and thus effectively increasing . We do so by utilizing a large number of genomic DLBCL datasets which, as with all major groups of cancer, are now publicly available online. We use these studies in combination with data from our own laboratory as a network meta-analysis to arrive at a better estimate of the covariance matrix whilst accounting for and assessing inter-study variation.

We note, that while the high-dimensional extension to is important it is out of scope in this paper. We restrict ourselves to the case where the total number of samples exceeds . Hence, if classes or datasets are available with sample sizes , we consider the case where while allowing to exceed for each individual class .

The remainder of this paper is structured as follows. In Section 2, we propose the model for a common covariance matrix across multiple studies, derive estimators thereof, and propose an inter-study homogeneity measure to aid in assessing the variation between studies. Next, we apply the model in Section 3 to DLBCL samples across 11 datasets. We then assess the proposed model estimators via synthetic data in Section 4 before concluding the manuscript in Section 5.

Although this work was motivated by gene-gene interaction networks in DLBCL, where the covariance matrix is assumed to contain all information about the conditional dependencies of the genes, the methods are general and not limited to such genomic data.

2 A hierarchical model for the covariance matrix

The model below was inspired by ordinary meta-analysis. Meta-analysis comes in various flavors corresponding to the assumption on the nature of the inter-study treatment effect. Random-effects models (REM) in meta-analysis model the inter-study effects as random variables

(DerSimonian and Laird, 1986; Choi et al., 2003). In a vein similar to the ordinary meta-analysis approach, we think of the different studies as related but perturbed experiments and propose the following simple random covariance model (RCM) of the observations. Let be the number of features and the number of classes. We model an observation from the ’th study as a

-dimensional zero-mean multivariate Gaussian vector with covariance matrix realized from an inverse Wishart distribution, i.e. 

follows the hierarchical model


where denotes a

-dimensional multivariate Gaussian distribution with mean

and positive definite (p.d.) covariance matrix

, and probability density function (pdf) shown in (

A.1), and denotes a -dimensional inverse Wishart distribution with degrees of freedom, a p.d. scale matrix , and pdf shown in (A.2). While the inverse Wishart distribution is defined for all

, the first order moment exists only when

and is given by


Hence, in the RCM of (2.1), can be interpreted as a location-like parameter as it is the expected covariance matrix in each study. The parameter inversely controls the inter-class variation and can as such be considered an inter-class homogeneity parameter of the covariance structure. A large corresponds to high study homogeneity and vice versa for small . This can further be seen as concentrates around for which corresponds to an vanishing inter-study variation for increasing

. This fact is seen directly from variance and covariance expressions for the inverse Wishart (see (

2.9) and (2.10)) where the 4th order denominator grows much faster than the 1st order nominator as polynomials in and causing the variance to vanish for . Thus, the true underlying covariance matrix and the homogeneity parameter are the effects of interest to be estimated in this paper. These basic properties of the RCM motivates the construction.

2.1 The likelihood function

Suppose are i.i.d. observations from independent studies from the model given in (2.1). Let be the matrix of observations for the ’th study where rows correspond to samples and columns to variables. By the independence assumptions, the log-likelihood for and is given by

Throughout this paper, we use the generic notation and for the conditional and unconditional pdf of random variables, respectively. Since the inverse Wishart distribution is conjugate to the multivariate Gaussian distribution, the integral—of which the integrand forms a Gaussian-inverse-Wishart distribution—can be evaluated. Hence can be marginalized out, cf. (A.4) in Appendix A, and we arrive at the following expression for the log-likelihood function,


up to an additive constant where is the multivariate generalization of the gamma function , see (A.3). As should be expected, the scatter matrix and study sample size are sufficient statistics for each study. Note that is conditionally Wishart distributed, , by construction.

As stated in the following two propositions, the likelihood is not log-concave in general. However, it is log-concave as a function of .

Proposition 1 (Non-concavity in ).

For a fixed , the log-likelihood function (2.3) is not concave in .

All proofs have been deferred to Appendix B.

Proposition 2 (Concavity in ).

For a fixed positive definite , the log-likelihood function (2.3) is concave in .

While the likelihood is not concave in we are able to show the existence and uniqueness of a global maximum in .

Proposition 3 (Existence and uniqueness).

The log-likelihood (2.3) has a unique maximum in for fixed and .

This result is proven in Appendix B and follows from two lemmas stated therein.

In the following section estimators of the parameters are derived using moments and the EM algorithm assuming fixed.

2.2 Moment estimator

The pooled empirical covariance matrix can be viewed as a moment estimator of . By the model assumptions, the first and second moment of the ’th observation in the ’th study, , is given by and

for all and . This suggests the estimators


where the latter is obtained by plugging into (2.2). This is the well-known pooled empirical covariance matrix.

2.3 Maximization using the EM algorithm

Here the updating scheme of the expectation-maximization (EM) algorithm

(Dempster, Laird and Rubin, 1977) for fixed is derived. We now compute the expectation step of the EM-algorithm.

From (2.1) we have that,

Let be the precision matrix and let , then we equivalently have that


From the conjugacy of the inverse Wishart and the Wishart distribution, the posterior distribution of the precision matrix is

Hence, by the expectation of the Wishart distribution,

The maximization step, in which the log-likelihood is maximized, yields the estimate which is the mean of the scaled precision matrices . The derivation of this estimate can be seen in Appendix C. Let be the current estimate of . This yields the updating scheme


for . We denote the inverse of the estimate obtained by repeated iteration of (2.6) by .

An approximate maximum likelihood estimator using a first order approximation is also possible. This derivation has been deferred to Appendix D.

2.4 Estimation procedure

We propose a procedure alternating between estimating and while keeping the other fixed. Given parameters and at iteration , we estimate using fixed . Subsequently, we find by a standard one-dimensional numerical optimization procedure using the fixed . This coordinate ascent approach is repeated until convergence as described in Algorithm 1.

2:Sufficient data:
3:Initial parameters:
4:Convergence criterion:
6:Parameter estimates:
7:procedure fitRCM()
8:     Initialize:
9:     for  do
13:         if  then
14:              return
15:         end if
16:     end for
17:end procedure
Algorithm 1 RCM coordinate ascent estimation procedure

The update function in the algorithm is defined by the derived estimators. That is, equations (2.4), (2.6), or (D.2) define as the pooled, EM, or approximate MLE estimates, respectively.

The procedure using the EM step utilizes the results about the RCM log-likelihood and thus provides a guarantee of convergence along with the advantage of a very simple implementation. Both the EM step and the update will always yield an increase in the likelihood. The disadvantage is that the identified stationary point might be a local maximum or saddle-point when considering the log-likelihood function jointly in . Intuitively, the latter possibility happens with zero probability, but it cannot be excluded that the maximum found is not global.

Variations on the convergence criterion can also be considered. One alternative is using the difference in successive parameter estimates. More variants can be created by considering relative rather than absolute differences.

2.5 Interpretation and inference

Test for no class heterogeneity

By the RCM construction parameterizes an inter-class variance where the size of corresponds to the homogeneity between the classes. A large yields high study homogeneity while a small yields low homogeneity. Thus, it might be of interest to test if the estimated homogeneity

is extreme under the null-hypothesis of no heterogeneity (i.e. infinite homogeneity). I.e. a test for the hypothesis

which is equivalent to

The two are equivalent since sampling the covariance matrix from the inverse Wishart distribution becomes deterministic for . Therefore, testing this hypothesis can also be interpreted as testing whether the data is adequately explained when leaving out the hierarchical structure.

The distribution of under the null hypothesis is not tractable. However, in practice under or when is extremely large the estimated will be finite as the intra-study variance dominates the total variance. We note that the null distribution of does not depend on . We propose approximating the distribution of under by resampling. To do this, the model is simply fitted a large number of times on datasets re-sampled under mimicked by permuted class labels to get . As small values of are critical for approximate acceptance regions can be constructed from . Likewise, an approximation of the  value testing can be obtained by

where is the indicator function. The addition of one in the nominator and denominator adds a positive bias to the approximate p-value minimally needed according to Phipson and Smyth (2010). This is approximately the fraction of ’s smaller than .

Intra-class correlation coefficient

We now introduce a descriptive statistic analogous to the intra-class correlation coefficient (ICC)

(Shrout and Fleiss, 1979) well known from ordinary meta-analysis to better determine what constitute large values of . For the RCM, the ICC is given by


This follows from the definition of the ICC which is the ratio of the between-study variation and the total variation of any single pair of variables. The ICC might in this sense be utilized in quantifying the reproducibility of the covariance across studies. Consider observations from (2.1). We temporarily abuse our notation and let

and consider only a single observation . Furthermore, let , , and . To compute the ICC, we are thus interested in the ratio of the quantities and corresponding to the between-study and total variation of the covariance between variables and , respectively. That is, the ICC is the proportion of the total variance between studies,


where the second equality is obtained by and the law of total variation. This equality agrees with the usual ICC as can be interpreted as the (expected) within-study variation. Using the conditional variance given by the needed quantities can be found. To compute an expression for (2.8) we need to consider the fourth-order moments of the observations. From the model, known results of the inverse Wishart distribution, cf. (Cook and Forzani, 2011; von Rosen, 1988), leads to


implying that


Continuing with the expected conditional variance of in the denominator of (2.8),


An expression of in terms of the elements of can then found by substituting (2.9) and (2.10) into (2.11) and by extension an expression for the ICC (2.8) can be obtained. We omit this tedious calculation which can be verified to yield as above. Naturally, the ICC depends only on . A straight-forward plug-in estimator of the ICC of some gene-gene interaction is then .

Though is required for the variances to exist, it is clear that for and for as should be expected.

2.6 Implementation and availability

Algorithm 1 and the different estimators are implemented in the statistical programming language R (R Core Team, 2012) with core functions in C++ using packages Rcpp and RcppArmadillo (Eddelbuettel and François, 2011; François, Eddelbuettel and Bates, 2012). They are incorporated in the open-source R-package correlateR freely available for forking and editing (Bilgrau, 2014a). We refer to the information here for further details and installation instructions. This document was prepared with knitr (Xie, 2013) and LaTeX. To reproduce this document see Acknowledgments or

3 DLBCL meta-analysis

Diffuse large B-cell lymphoma (DLBCL) is an aggressive cancer subtype accounting for of all non-Hodgkin’s lymphomas (NHL) which itself constitutes about of all lymphomas (International Lymphoma Study Group, 1997).

3.1 Data and preprocessing

A large amount of DLBCL gene expression datasets are now available online at the NCBI (National Center for Biotechnology Information) Gene Expression Omnibus (GEO) website. Ten large-scale DLBCL gene expression studies were downloaded and preprocessed using custom brainarray chip definition files (CDF) (Dai et al., 2005) and RMA-normalized using the R-package affy (Gautier et al., 2004). The corresponding GEO-accession numbers are GSE12195, GSE22895, GSE31312, GSE10846, GSE34171, GSE22470, GSE4475, and GSE19246 based on various microarray platforms. The downloaded data together with a dataset from our own laboratory (GSE56315, Dybkær et al. (2015)) yields a total of 2046 samples with study sizes in the range 78-469. The summarization using brainarray CDFs to Ensembl gene identifiers facilitates cross-platform integration.

After RMA normalization and summarization, the data was brought to a common scale by quantile normalizing all data to the common cumulative distribution function of all arrays. Lastly, the datasets were reduced to 11573 common genes represented in all studies and array platforms.

3.2 Analysis

A coexpression network (or weighted correlation network) analysis integrating all datasets was carried out. For each dataset the scatter matrix of the top 300 most variable genes (as measured by the pooled variance across all studies) was computed as the sufficient statistics along with the number of samples. Hence, we investigate 45,150 pairwise interactions.

The parameters of the RCM were estimated using the EM algorithm and yielded the matrix and . The RCM was fitted using three different initial sets of parameters which all converged to the same parameter estimates. Log-likelihood traces, iterations used, and computation times are seen in Figure 1. From the parameter estimate, the common expected covariance was computed and subsequently scaled to the corresponding correlation matrix .

Figure 1: The trace of the log-likelihood for three different starting values of and using the EM algorithm and computational times in minutes. The number of iterations used for each fit is shown above.

The estimated yields a surprisingly low estimated ICC of . Hence by the RCM, only of the variability of the gene-gene covariances is between-studies on average. This low ICC might suggest a high inter-study homogeneity of covariances meaning that the covariances reproduce across studies. However, the selection of only the most varying (across studies) genes is an obvious contribution to the low ICC. By selection, we have high inter-study variability which drives the ICC down. In any case, the low ICC might still suggest high reproducibility of the covariances between studies of the selected genes.

Figure 2:

Top left: dendrograms of the hierarchical clustering, the identified modules, and a heatmap of the correlation matrix. Top right: the correlation network as laid out by the Fruchterman-Reingold algorithm

(Fruchterman and Reingold, 1991). The nodes are colored after the identified modules. The edge colors follow the color key of the heatmap. If the edge weight is numerically less than 0.359, corresponding to the largest values, the edge is suppressed. Bottom: A Hierarchical edge bundling representation of the network (Holten, 2006; Bilgrau, 2014b) where edges loosely follow the dendrogram. Edge colors follow the color key. Only edges with a weight numerically larger than 0.359 are plotted.

Next, we outline one of many possible downstream analyses of the estimated covariance. For simplicity we employed a standard correlation network analysis to the estimated common correlation matrix across all studies. To identify clusters with high internal correlation, we used agglomerative hierarchical clustering with Ward-linkage and distance measure defined as 1 minus the absolute value of the correlation. The tree was arbitrarily pruned at a height which produced 5 modules named by colors. Figure 2 shows these results. A heat-map, the hierarchical tree, and the identified modules are seen at the top left. The top-right shows a graphical representation of the matrix which better illustrates the clusters and their relations. The bottom plot shows the graph radially laid out using hierarchical edge bundling (Holten, 2006) where the edges are guided by the hierarchical tree. Table 1 shows the top genes within each module. As seen e.g. in the olivegreen module, genes from the same gene family cluster together.

To further investigate the low ICC, we refitted the RCM on the subset of the 50 genes in the orchid module. This yielded an

. Subsequently, the model was repeatedly fitted on 50 randomly chosen genes 500 times to gauge the size of the ICC without the selection bias. This resulted in a mean ICC (lower quartile, upper quartile) of

suggesting that there is a high study homogeneity under randomly selected genes.

The test for the null-hypothesis of no study heterogeneity, , is clearly rejected with a  value of 0.002. The mean (sd) of the fitted on 500 permuted datasets was compared to in the observed dataset.

Next, the modules were screened for biological relevance using GO (Gene Ontology) enrichment analysis. The upper right of Figure 2 shows suggested functions of the modules primarily based on the GO analysis. Acknowledgments shows the significant GO-terms at significance level 0.01 for each module in which the most GO-terms appear highly relevant to the pathology of DLBCL.

Gray Olivegreen Orchid Skyblue Coral
n = 159 n = 50 n = 50 n = 31 n = 10
Table 1: The identified modules, their sizes, and member genes. The genes are sorted decreasingly by their intra-module connectivity (sum of the incident edge weights). Only the top 20 genes are shown.

We checked if the identified modules were prognostic for overall survival (OS) in the CHOP and R-CHOP-treated cohort datasets of GSE10846. To do this, the eigengene (Horvath, 2011) for each module was computed and a multiple Cox proportional hazards model for OS was fitted with the module eigengenes as covariates. The module eigengene is the first principal component of the expression matrix of the module which thus can be represented by a linear combination of the module genes. For the prognostic interesting orchid module, the Kaplan-Meier estimates were computed for groups arising when dichotomizing the values of the corresponding eigengene as above or below the median value. These results are seen in Figure 3.

Figure 3: The top row shows and CI for the hazard ratio for each eigengene in the multiple Cox proportional hazards model containing all eigengenes. The bottom row shows Kaplan-Meier estimates (and CI) for the overall survival for patients stratified by the dichotomized orchid eigengene.

From the survival analysis, the olivegreen module appeared particularly interesting since it identifies a group of patients with high value of the corresponding eigengene and poor survival, thus making them candidates for experimental therapies. The gene with highest connectivity in this module is S100A8 (MRP8; calgranulin A) and the gene S100A9 (MRP14; calgranulin B) appears as number 16 further down the connectivity list. This is interesting as compelling research has shown that the S100 family of calcium-binding proteins maintain immunosuppressive MDS cells at the tumor site (Fulmer, 2008). Notably, in mice injected with lymphoma cells, knockout of S100A9 resulted in greater tumor infiltration of T-cells and less accumulation of MDS cells than that seen in wild-type mice (Cheng et al., 2008). The knockout mice had higher rates of tumor rejection and lower tumor size than their wild-type littermates. This result indicates that knockdown of these proteins may improve the outcome of immunotherapy strategies in patients with values of the eigengene of the olivegreen module.

The orchid module also appeared interesting as it marks a gene cluster identifying DLBCL patients with significantly improved outcome. Therefore, a manual screening of the 50 genes within the orchid module was performed. The genes CHI3L1, CHIT1, and LYZ are related to chitin degradation and suggests activated immune system response and inflammation. Enzymes related to chitin degradation can possibly originate from macrophages as CHIT1 is expressed by activated macrophages. The inflammation and modulated activity of the immune system are further suggested by the genes ORM1, PLA2G2D, PLA2G7, and IL18. CHIT3L1 (also known as YKL40) has been linked to the AKT anti-apoptotic signaling pathway in glioblastoma (Francescone et al., 2011) and thus high CHIT3L1 is associated with poor outcome. Some of the remaining, MMP9, PTGDS, ADAMDEC1, HSD11B1, APOC1, and CYP27B1 are involved in metalloproteinase degradation and lipid metabolism. MMP9 in particular is known to have a central role in proliferation, migration, differentiation, angiogenesis, apoptosis, and host defenses. Numerous studies have linked altered MMP expression in different human cancers with poor disease prognosis where up-regulation of MMPs are associated with enhanced cancer cell invasion. ADAMDEC1 is thought to have a central role in dendrite cell functions and their interactions with germinal center T-cells.

The manual screening and GO-analysis results further corroborate that the identified modules are biologically meaningful and that the RCM provides a useful estimate of the covariance.

We finally investigated the choice of using the top 300 most variable genes and fitted the RCM with the top 1000 most variable gene. The approximate MLE algorithm followed by the EM algorithm estimated an matrix and in hours. This implies that ; a somewhat smaller value than the above. The common correlation matrix was computed from and and hierarchical clustering was applied as above. This showed that the same genes in the top 300 based analysis also largely cluster together in this clustering. Particularly, all but one gene from the olivegreen module cluster together.

4 Assessment of the estimation procedures

To assess the precision and stability of the estimation procedure we generated data from the hierarchical model (2.1) under two scenarios with studies with an equal number of observations, , and with different choices of the number of genes . In the first scenario, we set with the number of observations in each study varying in the range . In the second scenario, we set with in the range . In both scenarios, we chose the parameters and for all and for all .

We measured the precision of the estimated values against the expected covariance matrix given by (2.2). Let and be the estimates obtained using the moment, EM, or approximate MLE (defined in Appendix D) approaches as described. We benchmark the proposed estimators against the known truth. The benchmarking measure used is the following weighted sum of squared errors,

In both scenarios, the weighted sum of squared errors for each and estimator was computed for datasets. The mean SSEs (and 99% CI) of these values are seen in panels A and B of Figure 4 as function of .

In scenario 1, the EM approach used a total of 47.7 minutes of computation time for fits compared to 9.3 and 4.6 minutes for the pool and approximate MLE approach, respectively. In scenario 2, the EM, pooled, and approximate MLE approach used 939.9, 8.1, and 149.9 minutes respectively. Panel C of Figure 4 shows the computational time in a third scenario for more extreme values of .

Figure 4: Panel A: The mean SSE (and 99 % CI) for scenario 1 from 1000 simulations as a function of the number of samples in each of 3 studies. Panel B: The mean SSE (and 99 % CI) for scenario 2. Panel C: The mean computation time of fits with varying dimension . All panels show the constant parameters in the scenario.

From both panels A and B of Figure 4, we see that the EM estimation is superior with less variation to that of the approximate MLE and moment estimators. As seen from the panel C and above computational times, this superiority comes at a computational cost rapidly increasing in .

5 Concluding remarks

This article provides a basic framework for modeling a common covariance structure across multiple classes or datasets applied to identify biologically meaningful gene-gene interaction networks in DLBCL. The straight-forward approaches of using the mean or pooled covariance matrix are seen as moment estimators of the model and the estimate using the EM algorithm is shown to be superior to these simple alternatives. While the improvements are modest, the article demonstrates a potentially advantageous way of modelling the inter-study variability by a hierarchical random covariance model. However, the virtue of such a model is not from improvement in accuracy alone. The model-based approach allows investigating and testing both pairwise and global covariance structures correlations of genes, which we consider for future research. Another feature is the invertible estimate provided by the RCM, which allows users to explore the corresponding Gaussian graphical model. Also desirable is the explicit and interpretable quantification of the inter-study variance. If is estimated to be large, the studies exhibit a largely common covariance structure, and vice-versa when is estimated to be small. We have provided the ICC for the RCM as an attempt to aid in the interpretation of . Also provided is the basic framework for testing if study heterogeneity is present. However, the proposed testing is computationally demanding and only feasible when is sufficiently small. This could e.g. be overcome by an improved and faster fitting procedures or by deriving the distribution of under the null hypothesis. Yet the latter is seemingly intractable as is a very complex function of the data. The fact that the null-hypothesis lies on the edge of parameter space also seems to constrain the feasibility of deriving such a distribution.

One might question whether the added utility of the parameter is an obvious relaxation of covariance homogeneity. For example, it is unclear how large a proportion a single extra parameter can explain of the inter-study variance. Hence, the present work should be considered a first step in the direction of explicitly modeling the inter-study variation of covariances.

As demonstrated, combining multiple studies can yield a sufficiently large total sample size that allows for estimation of large covariance matrices without the use of regularization. However, we believe this work could be further enriched by combining the method with regularized estimation. Such a generalization of the model to is extremely interesting though out of scope for this article.

The recent advances in such regularized techniques which allows for analysis of large covariance matrices has unfortunately diminished the focus on collecting an adequate number of samples. The technically possible estimates for extreme

ratios does not necessarily imply that a good estimate is achieved. For example, while non-zero entries can often be accurately recalled in graphical lasso, actual estimates of the covariances (or precisions) can still be heavily biased. Large sample-sizes are still needed to achieve unbiased estimates of the covariance due to the bias-variance trade-off. An increased focus should therefore also appointed to efficiently aggregating datasets and achieving sufficiently large sample sizes to allow for stable and unbiased estimation of covariance matrices. Likewise, results should be validated in independent experiments.


We thank Martin Raussen, Jon Johnsen, as well as Niels Richard Hansen for their assistance on some of the mathematical proofs. The helpful comments from Steffen Falgreen, Andreas S. Pedersen, and reviewers were also much appreciated. The technical assistance from Alexander Schmitz, Julie S. Bødker, Ann-Maria Jensen, Louise H. Madsen, and Helle Høholt is also greatly appreciated.

Supplement A Documents for reproducibility [url] The documents and other needed files to perform the analyses to reproduce this article. See the README file herein.

Supplement B Identified modules and GO analysis [url] Tables of gene module memberships, auxiliary information, and the significant GO-terms for each identified module.


  • Agnelli et al. (2011) [author] Agnelli, LucaL., Forcato, MattiaM., Ferrari, FrancescoF., Tuana, GiacomoG., Todoerti, KatiaK., Walker, Brian aB. a., Morgan, Gareth JG. J., Lombardi, LuigiaL., Bicciato, SilvioS. Neri, AntoninoA. (2011). The reconstruction of transcriptional networks reveals critical genes with implications for clinical outcome of multiple myeloma. Clinical Cancer Research 17 7402–12.
  • Bilgrau (2014a) [author] Bilgrau, Anders EllernA. E. (2014a). correlateR: Fast, efficient, and robust partial correlations R package version 0.1.
  • Bilgrau (2014b) [author] Bilgrau, Anders EllernA. E. (2014b). HierarchicalEdgeBundles: Plotting via Hierarchical Edge Bundles R package version 0.1.
  • Butte et al. (2000) [author] Butte, A JA. J., Tamayo, PP., Slonim, DD., Golub, T RT. R. Kohane, I SI. S. (2000). Discovering Functional Relationships Between RNA Expression and Chemotherapeutic Susceptibility using Relevance Networks. PNAS 97 12182–6. 10.1073/pnas.220392197
  • Cheng et al. (2008) [author] Cheng, PingyanP., Corzo, Cesar AC. A., Luetteke, NoreenN., Yu, BinB., Nagaraj, SrinivasS., Bui, Marylin MM. M., Ortiz, MyrnaM., Nacken, WolfgangW., Sorg, ClemensC., Vogl, ThomasT. et al. (2008). Inhibition of dendritic cell differentiation and accumulation of myeloid-derived suppressor cells in cancer is regulated by S100A9 protein. The Journal of experimental medicine 205 2235–2249.
  • Choi et al. (2003) [author] Choi, J. K.J. K., Yu, U.U., Kim, S.S. Yoo, O. J.O. J. (2003). Combining Multiple Microarray Studies and Modeling Interstudy Variation. Bioinformatics 19 i84–i90. 10.1093/bioinformatics/btg1010
  • Clarke et al. (2013) [author] Clarke, ColinC., Madden, Stephen FS. F., Doolan, PadraigP., Aherne, Sinead TS. T., Joyce, HelenaH., O’Driscoll, LorraineL., Gallagher, William MW. M., Hennessy, Bryan TB. T., Moriarty, MichaelM., Crown, JohnJ., Kennedy, SusanS. Clynes, MartinM. (2013). Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis 34 2300–2308.
  • Cook and Forzani (2011) [author] Cook, R. DennisR. D. Forzani, LilianaL. (2011). On the Mean and Variance of the Generalized Inverse of a Singular Wishart Matrix. Electronic Journal of Statistics 5 146–158. 10.1214/11-EJS602
  • Dai et al. (2005) [author] Dai, ManhongM., Wang, PinglangP., Boyd, Andrew DA. D., Kostov, GeorgiG., Athey, BrianB., Jones, Edward GE. G., Bunney, William EW. E., Myers, Richard MR. M., Speed, Terry PT. P., Akil, HudaH., Watson, Stanley JS. J. Meng, FanF. (2005). Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data. Nucleic Acids Research 33 e175. 10.1093/nar/gni179
  • Dempster (1972) [author] Dempster, A. P.A. P. (1972). Covariance Selection. Biometrics 28 157–175.
  • Dempster, Laird and Rubin (1977) [author] Dempster, A. P.A. P., Laird, N. M.N. M. Rubin, D. B.D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 39 1–38.
  • DerSimonian and Laird (1986) [author] DerSimonian, RR. Laird, NN. (1986). Meta-analysis in Clinical Trials. Controlled Clinical Trials 7 177–88.
  • Dybkær et al. (2015) [author] Dybkær, KarenK., Bøgsted, MartinM., Falgreen, SteffenS., Bødker, Julie S. J. S., Kjeldsen, Malene K. M. K., Schmitz, AlexanderA., Bilgrau, Anders E. A. E., Xu-Monette, Zijun Y. Z. Y., Li, LingL., Bergkvist, Kim S. K. S., Laursen, Maria B. M. B., Rodrigo-Domingo, MariaM., Marques, Sara C. S. C., Rasmussen, Sophie B. S. B., Nyegaard, MetteM., Gaihede, MichaelM., Møller, Michael B. M. B., Samworth, Richard J. R. J., Shah, Rajen D. R. D., Johansen, PrebenP., El-Galaly, Tarec C. T. C., Young, Ken H. K. H. Johnsen, Hans E. H. E. (2015). A Diffuse Large B-Cell Lymphoma Classification System That Associates Normal B-cell Subset Phenotypes with Prognosis. Journal Of Clinical Oncology, In press. 10.1200/JCO.2014.57.7080
  • Eddelbuettel and François (2011) [author] Eddelbuettel, DirkD. François, RomainR. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40.
  • Fan and Tang (2013) [author] Fan, YingyingY. Tang, Cheng YongC. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 531–552.
  • François, Eddelbuettel and Bates (2012) [author] François, RomainR., Eddelbuettel, DirkD. Bates, DougD. (2012). RcppArmadillo: Rcpp Integration for Armadillo Templated Linear Algebra Library R package version
  • Francescone et al. (2011) [author] Francescone, Ralph AR. A., Scully, SteveS., Faibish, MichaelM., Taylor, Sherry LS. L., Oh, DennisD., Moral, LuisL., Yan, WeiW., Bentley, BrookeB. Shao, RongR. (2011). Role of YKL-40 in the Angiogenesis, Radioresistance, and Progression of Glioblastoma. Journal of Biological Chemistry 286 15332–15343.
  • Friedman, Hastie and Tibshirani (2008) [author] Friedman, JeromeJ., Hastie, TrevorT. Tibshirani, RobertR. (2008). Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics 9 432–41. 10.1093/biostatistics/kxm045
  • Fruchterman and Reingold (1991) [author] Fruchterman, Thomas MJT. M. Reingold, Edward ME. M. (1991). Graph Drawing by Force-Directed Placement. Software: Practice and experience 21 1129–1164.
  • Fulmer (2008) [author] Fulmer, TimT. (2008). Suppressing the suppressors. 38. 10.1038/scibx.2008.914
  • Gautier et al. (2004) [author] Gautier, LaurentL., Cope, LeslieL., Bolstad, Benjamin M.B. M. Irizarry, Rafael A.R. A. (2004). affy—Analysis of Affymetrix GeneChip Data at the Probe Level. Bioinformatics 20 307–315.
  • International Lymphoma Study Group (1997) [author] International Lymphoma Study Group (1997). A Clinical Evaluation of the International Lymphoma Study Group Classification of Non-Hodgkin’s Lymphoma. Blood 89 3909–3918.
  • Holten (2006)

    [author] Holten, DannyD. (2006). Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. Visualization and Computer Graphics, IEEE Transactions on 12 741–748.

  • Horvath (2011) [author] Horvath, SteveS. (2011). Weighted Network Analysis: Applications in Genomics and Systems Biology. Springer.
  • Khalil (2002) [author] Khalil, Hassan K.H. K. (2002). Nonlinear Systems. Prentice Hall.
  • Lauritzen (1996) [author] Lauritzen, Steffen LS. L. (1996). Graphical models. Clarendon Press.
  • Lin et al. (2007) [author] Lin, Fa-HsuanF.-H., Tsai, Shang-YuehS.-Y., Otazo, RicardoR., Caprihan, ArvindA., Wald, Lawrence LL. L., Belliveau, John WJ. W. Posse, StefanS. (2007). Sensitivity-Encoded (SENSE) Proton Echo-Planar Spectroscopic Imaging (PEPSI) in the Human Brain. Magnetic Resonance in Medicine 57 249–57. 10.1002/mrm.21119
  • Margolin et al. (2006) [author] Margolin, Adam aA. a., Nemenman, IlyaI., Basso, KatiaK., Wiggins, ChrisC., Stolovitzky, GustavoG., Dalla Favera, RiccardoR. Califano, AndreaA. (2006). ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7 Suppl 1 S7.
  • Markowetz and Spang (2007) [author] Markowetz, FlorianF. Spang, RainerR. (2007). Inferring Cellular Networks - A Review. BMC Bioinformatics 8 S5. 10.1186/1471-2105-8-S6-S5
  • Meinshausen and Bühlmann (2006) [author] Meinshausen, NN. Bühlmann, PP. (2006). High dimensional graphs and variable selection with the lasso. The Annals of Statistics.
  • Petersen and Pedersen (2008) [author] Petersen, KBK. Pedersen, MSM. (2008). The Matrix Cookbook Technical University of Denmark, Technical Manual.
  • Phipson and Smyth (2010) [author] Phipson, BelindaB. Smyth, Gordon KG. K. (2010). Permutation P-values Should Never be Zero: Calculating Exact P-values when Permutations are Randomly Drawn. Statistical Applications in Genetics and Molecular Biology 9.
  • Shrout and Fleiss (1979) [author] Shrout, Patrick EP. E. Fleiss, Joseph LJ. L. (1979). Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin 86 420.
  • R Core Team (2012) [author] R Core Team (2012). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria ISBN 3-900051-07-0.
  • van Wieringen and Peeters (2016)

    [author] van Wieringen, Wessel N.W. N. Peeters, Carel F. W.C. F. W. (2016). Ridge Estimation of Inverse Covariance Matrices from High-Dimensional Data. Computational Statistics and Data Analysis 103 284–303. 10.1016/j.csda.2016.05.012

  • von Rosen (1988) [author] von Rosen, DD. (1988). Moments for the Inverted Wishart Distribution. Scandinavian Journal of Statistics 15 97–109.
  • Xie (2013) [author] Xie, YihuiY. (2013). Dynamic Documents with R and knitr. CRC Press.

Appendix A Marginalization of the covariance

This section shows the marginalization over in (2.3). Recall the model (2.1) where denotes a -dimensional multivariate Gaussian distribution with mean and positive definite (p.d.) covariance matrix with probability density function (pdf)


and where denotes a -dimensional inverse Wishart distribution with degrees of freedom, a p.d. scale matrix , and pdf


where is p.d. and is the multivariate generalization of the gamma function given by


For ease of notation we drop the subscript on , , , and . By the model assumptions,

The integrand can be recognized as a unnormalized inverse Wishart pdf of the distribution , and so the integral evaluates to the reciprocal value of the normalizing constant in that density. Thus,


Using the matrix determinant lemma and , this can be further simplified to

which can help to speed-up computations.

Appendix B Proofs

b.1 Non-concavity of the log-likelihood

The likelihood function is not log-concave in general. This section analyses the (non)-concavity of the log-likelihood function given in (2.3). More precisely, the following two propositions are proved.

See 1

See 2

Proof of Proposition 1.

Assume is fixed and consider only the terms involving in (2.3). We reduce to the one-dimensional case where

which implies

It is straightforward to show there exists a value for , and for which . Since the second derivative is not always negative the log-likelihood is not log-concave. ∎

Proof of Proposition 2.

Consider the terms involving . Clearly, the mixed terms involving both and are log-linear in and hence log-concave. We thus restrict our attention to the remaining terms not dependent on . The sum of these terms are concave in , since

which can be seen to be concave since for all and is concave for all and . The concavity of is easily seen by the fact that where is the tri-gamma function. The tri-gamma function is a well-known monotonically decreasing function. Hence, the likelihood is log-concave in . ∎

b.2 Existence and uniqueness of likelihood maxima

This section proves Lemmas 1 and 2 which imply Proposition 3.

Before we state the lemmas, the proposition, and their proofs, we see that the reparameterisation of the RCM is irrelevant. Consider the log-likelihood in (2.3) assuming fixed. The log-likelihood obey


Notice, that this equation also holds in the reparameterization. Here we have

Since is only dependent on data (when is fixed) we can set . Without loss of generality we can therefore consider (B.1) in the following.

See 3

Proof of Proposition 3.

We first prove existence of the maximum. Note, that we may consider as a function on a vector space by letting where is a symmetric matrix. By Lemma 1 and the continuity of , the set is bounded and closed and thus compact for any . The existence of a maximum follows from the extreme value theorem by the continuity of . A stationary point exists due to Rolle’s theorem and the differentiability of .

Next, we show the uniqueness of the maximum. Let denote the set of stationary points, which is nonempty. By Lemma 1, has a finite upper bound given by the maximum of the log-likelihood in those points. All gradient curves (that is, solution curves to ) must then converge toward exactly one of the stationary points where monotonically increases along each curve. Define for the basin of attraction

The basin of attraction is open if is a maximum (Khalil, 2002, Lemma 4.1). By Lemma 2, is always a maximum and hence all are open sets in the set of all positive definite matrices . This partitions the space into disjoint, non-empty, open sets. Since is connected, this is only possible if and thus there is only a single basin of attraction and maximum of . ∎

Lemma 1.

If there exists an eigenvalue

of such that or , then for fixed and .

Proof of Lemma 1.

Assume the hypothesis of the lemma and consider the expression given in (B.1) up to the addition of a constant. The likelihood obey the following two upper bounds. First,

Secondly, let whereby (B.1) can be expressed as

Since is concave and the above sum is a convex combination, we have


where and . Three cases now exists: 1) If , then

2) If , then

as the matrix in the second term is almost surely positive definite when and the log determinant is some constant. 3) If is bounded and the largest eigenvalue (and hence , then and is bounded away from zero. Therefore,