The analysis of high-dimensional data is important in many scientific areas, and often poses the challenge of the availability of a relatively small number of cases versus a large number of unknown parameters. It has been documented both practically and theoretically that under the assumption of sparsity of the underlying model, larger effects or dependencies can be inferred even in the very high-dimensional case[26, 22]
. Still in many cases conclusions can be much improved by incorporating prior knowledge in the analysis, or by “borrowing information” by simultaneously analysing multiple related datasets. In this paper we introduce a methodology that achieves both, and that is at the same time scalable to large datasets in its computational complexity. It is based on an empirical Bayesian setup, where external information is incorporated through the prior, and information is borrowed across similar analyses by empirical Bayes estimation of hyperparameters. Sparsity is induced through utilisation of the horseshoe prior, and computational efficiency through novel variational Bayes approximations to the posterior distribution. We illustrate the methodology by two applications in genomics: network reconstruction and eQTL mapping, but the proposed framework should be useful also for analysing other large-scale data sets with complex dependence structures.
Our working model is a collection of linear regression models, indexed by, corresponding to characteristics (e.g. genes). For each characteristic we have measurements on individuals, labelled , consisting of a univariate response
and a vectorof explanatory variables. We collect the responses on characteristic in the -vector and similarly collect the explanatory variables in the -matrix , having rows , and adopt the regression models
Here the regression coefficients form a vector in , and the error vectors ’s are unobserved. The dimension of the regression parameter may be different for different characteristics .
Our full set of observations consists of the pairs , whose stochastic dependence will not be used and hence need not be modelled. In addition to these regression pairs we assume available prior information on the vectors in the form of a 2-dimensional array , whose th row presents a grouping of the coordinates of into groups, indexed by : the value is the index of the group to which the th coordinate of belongs. (Because the may have different lengths, is a possibly “ragged array” and not a matrix.) The information in is considered to be soft in that coordinates of that are assigned to the same group are thought to be similar in size, but not necessarily equal. The information may for instance come from a previous analysis of similar data, or be taken from a genomic database.
We wish to analyse this data, satisfying four aims:
Borrow information across the characteristics by linking the analyses of the models (1) for different .
Incorporate the prior information in a soft manner so that it informs the analysis if correct, but can be overruled if completely incompatible with the data.
Allow for sparsity of the explanatory models, i.e. focus the estimation towards parameter vectors with only a small number of significant coefficients, enabling analysis for small relative to and/or .
Achieve computational efficiency, enabling analysis with large and/or .
To this purpose we model the parameters and the scales of the error vectors through a prior, and next perform empirical Bayesian inference. This analysis is informed by the model (1) and the following hierarchy of a generating model (referred to as pInc later on) for the errors and a prior model for :
Here N is a (multivariate) normal distribution,is the
denotes the standard Cauchy distribution restricted to the positive real axis, and
denotes the gamma distribution with shape and rate parametersand . As usual the hierarchy should be read from bottom to top, where dependencies of distributions on variables at lower levels are indicated by conditioning, and absence of these variables in the conditioning should be understood as the assumption of conditional independence on variables at lower levels of the hierarchy. The specification (2) gives the model for the th characteristic. The models for different are linked by assuming the same values of the hyperparameters for all . These hyperparameters will be estimated from the combined data
by the empirical Bayes method, thus borrowing strength across responses and achieving the first of the four aims, as listed previously.
We also consider a variant of the model (later referred to as pInc2) in which the last line of the hierarchy is dropped and the parameters are pooled into a single parameter per group (). The parameters are then estimated by empirical Bayes on the data pooled over . In some of the simulations this model outperformed (2).
The th row of gives a grouping of the coordinates of into groups. The scheme (2) attaches a latent variable to each group, for , whose squares possess inverse gamma distributions, independently across groups. These latent variables enter the prior distributions of the coordinates of , which marginally given are scale mixtures of the normal distribution. Choosing the scale parameters from the half-Cauchy distribution gives the so-called horseshoe prior [10, 11]. This may be viewed as a continuous alternative to the traditional spike-and-slab prior, which is a mixture of a Dirac measure at zero and a widely spread second component, and is widely used as a prior that induces sparsity.
The horseshoe density with scale is the mixture of the univariate normal distributions relative to the parameter . It combines an infinite peak at zero with heavy tails, and is able to either shrink parameters to near zero or estimate them unbiasedly, much as an improper flat prior. The relative weights of the two effects are moderated by the value of . In the model (2) the coordinates of corresponding to the same group receive a common parameter , and are thus either jointly shrunk to zero or left free, depending on the value of . This allows to achieve the aims two and three as listed previously. Theoretical work in [62, 63, 64, 11, 13] (in a simpler model) suggests an interpretation of as approximately the fraction of nonzero coordinates in the th group, and corroborates the interpretation of as a sparsity parameter. In model (2) this number is implicitly set by the data, based on the inverse gamma prior on . Requiring the hyperparameters of these gamma distributions to be the same across the characteristics induces the borrowing of information between the characteristics , in particular with respect to the sparsity of the vectors .
Model (2) chooses the squares of the scales
of the error variables from an inverse gamma distribution, which is the usual conjugate prior. The priors on the regression parametersare also scaled by , thus giving them a priori the same order of magnitude. This seems generally preferable.
The Bayesian model described by (1) and (2) leads to a posterior distribution of in the usual way, but this depends on the hyperparameters . In Section 4.2 we introduce a method to estimate these hyperparameters from the full data , and next base further inference on the posterior distributions of the parameters evaluated at the plugged-in estimates of the hyperparameters. Because the prior on the coefficients is continuous, the posterior distribution does not provide automatic model (or variable) selection, which is a disadvantage of the horseshoe prior relative to the spike-and-slab priors. To overcome this, we develop a way of testing for nonzero regression coefficients based on the marginal posterior distributions of the in Section 4.3.
The horseshoe prior has gained popularity, mainly due to its computational advantage over spike-and-slab priors. However, in our high-dimensional setting the approximation of the posterior distribution by an MCMC scheme turns out to be still a computational bottleneck. The algorithm studied by , which can be applied in the special case of a single group () has complexity for a single regression (i.e. ) per MCMC iteration. We show in Section 5.2 that this is too slow to be feasible in our setting. For this reason we develop in Section 4.1 a variational Bayesian (VB) scheme to approximate the posterior distribution, in order to satisfy the fourth aim in our list.
The variational Bayesian method consists of approximating the posterior distribution by a distribution of simpler form, which is chosen as a compromise between computational tractability and accuracy of approximation. The quality of the approximation is typically measured by the Kullback-Leibler divergence. Early applications involved standard distributions such as Gaussian, Dirichlet, Laplace and extreme value models [3, 46, 1, 2, 67]. In the present paper we use nonparametric approximations, restricted only by the assumption that the various parameters are (block) independent. (This may be referred to as mean-field variational Bayes, although this term appears to be used more often for independence of all univariate marginals, whereas we use block independence.) In this case the variational posterior approximation can be calculated by iteratively updating the marginal distributions [6, 49]. Variational Bayes typically produces accurate approximations to posterior means, but have been observed to underestimate posterior spread [20, 7, 9, 45, 68, 59, 70, 69]. We find that in our setting the approximations agree reasonably well to MCMC approximations of the marginals, although the latter take much longer to compute.
The model (1)-(2) may be useful for data integration in a variety of scientific setups, and for data sources as diverse as gene expression, copy number variations, single nucleotide polymorphisms, functional magnetic resonance imaging, or social media data. The external information incorporated in the array may reflect data of a different type, and/or of a different stage of research, and the simultaneous analysis of different characteristics allows further data integration. For example, in genetic association studies data from multiple stages can help the identification of true associations [52, 23, 27]. In this paper we consider applications to gene regulatory networks and to eQLT mapping, which we describe in the next two sections, before developing the general algorithms for models (1) and (2).
The remainder of the paper is organised as follows. In Section 4.1 we develop a variational Bayes approach to approximate the posterior distributions of the regression parameters for given hyperparameters, and show this to be comparable in accuracy to Gibbs sampling in Section 5.2, although computationally much more efficient. In Section 4.2 we develop the Empirical Bayes (EB) approach for estimating the hyperparameters, and in Section 4.3 we present a threshold based-procedure for selecting nonzero regression coefficients based on the marginal posterior distributions of the . We show in Section 5 by means of model-based simulations that the proposed approach performs better, in terms of both average -error and average ROC curves, than its ridge counterpart in the framework of network reconstruction. The potential of our approach is shown on real data in Section 6 both in gene regulatory network reconstruction and in eQTL mapping. Section 7 concludes the paper.
2 Network reconstruction
The identification of gene regulatory networks is crucial for understanding gene function, and hence important for both treatment and prediction of diseases. Prior knowledge on a given network is often available in the literature, from repositories or pilot studies, and combining this with the data at hand can significantly improve the accuracy of reconstruction .
A Gaussian graphical model readily gives rise to a special case of the model (1)-(2). In such a model the data concerning genes measured in a single individual (e.g. tissue) is assumed to form a multivariate Gaussian -vector, and the network of interest is the corresponding conditional independence graph . The nodes of this graph are the genes and correspond to the coordinates of the Gaussian vector. Two nodes/genes are connected by an edge in the graph if the corresponding coordinates are not conditionally independent given the other coordinates. It is well known that this is equivalent to the corresponding element in the precision matrix of the Gaussian vector being nonzero .
Assume that we observe a gene vector for individuals, giving rise to independent copies of -vectors satisfying
Here is the precision matrix; its inverse is the covariance matrix of the vector and is assumed to be positive-definite. The Gaussian graphical model consists of a graph with nodes and with edges given by the nonzero elements of the precision matrix. Hence to reconstruct the conditional independence graph it suffices to determine the non-zero elements of the latter matrix.
We relate this to the notation used in the introduction by writing , and next collecting the observations per gene , giving the -vector , for . We next define
as the -matrix with columns , for . It is well known that the residual when regressing a single coordinate of a multivariate Gaussian vector linearly on the other coordinates , for , is Gaussian. Furthermore, the regression coefficients can be expressed in the precision matrix of as
This shows that (1) holds with and a multivariate normal error vector
with varianceequal to the residual variance. Moreover, the (non)zero entries in the th row vector of the precision matrix correspond to the (non)zero coordinates of . Consequently, the problem of identifying the Gaussian graphical model can be cast as a variable selection problem in the regression models (1).
This approach of recasting the estimation of the (support of the) precision matrix as a collection of regression problems was introduced by 
, who employed Lasso regression[58, 17] to estimate the parameters. Other variable selection methods can be employed as well . A Bayesian approach with Gaussian, ridge-type priors on the regression coefficients was developed in , and extended in  to incorporate prior knowledge on the conditional independence graph. A disadvantage of the Gaussian priors employed in these papers is that they are not able to selectively shrink parameters, but shrink them jointly towards zero (although prior information used in  alleviates this by making this dependent on prior group). This is similar to the shrinkage effect of the ridge penalty  relative to the Lasso, which can shrink some of the precision matrix elements to exactly zero, and hence possesses intrinsic model selection properties. The novelty of the present paper is to introduce the horseshoe prior in order to better model the sparsity of the network.
We assume that the prior knowledge on the to-be-reconstructed network is available as a “prior network”, which specifies which edges (conditional independencies) are likely present or absent. This information can be coded in an adjacency matrix P, whose entries take the values 0 or 1 corresponding to the absence and presence of an edge: if variable is connected with variable and otherwise. Thus in this example we only have two groups, i.e. .
The advantage of reducing the network model to structural equation models of the type (1) is computational efficiency. An alternative would be to model the precision matrix directly through a prior. This would typically consist of a prior on the graph structure, followed by a specification of the numerical values of the precision matrix given its set of nonzero coefficients. The space of graphs is typically restricted to e.g. decomposable graphs, forests, or trees [21, 15, 31]. The posterior distribution of the graph structure can then be used as the basis of inference on the network topology. However, except in very small problems, the computational burden is prohibitive.
3 eQTL mapping
In eQTL mapping the expression of a gene is taken as a quantitative trait, and it is desired to identify the genomic loci that influence it, much as in a classical study of quantitative trait loci (QTL) of a general phenotype. Typically one measures the expression of many genes simultaneously and tries to map these to their QTL. Since gene expression levels are related to disease susceptibility, elucidating these eQLT (expression QTL) may give important insights into the genetic underpinnings of complex traits. We shall identify genetic loci here with single nucleotide polymorphisms (SNPs), but other biomarkers can be substituted.
Early works by [12, 57, 76] considered every gene separately for association. However, many genes are believed to be co-regulated and to share a common genetic basis [51, 74]. In addition, SNPs with pleiotropic effects may be more easily identified by considering multiple genes together. Therefore following [56, 41, 32], we focus on a joint analysis, borrowing information across genes. We regress the expression of a given gene on SNPs both within and around the gene, where our model is informed about the SNP location. The sparse parametrization offered by our model is suitable, as most genetic variants are thought to have a negligible (if any) differential effect on expression.
Suppose we collect the (standardized) expression levels of genes over individuals, and identify for each gene a collection of SNPs to be investigated for association. For instance, the latter collections might contain all SNPs in a relatively large window around the gene, some of which falling inside the gene and some outside. For each individual and SNP we ascertain the number of minor alleles (, or ), and change all 2’s to 1’s. Because there are not many 2’s in the data this does not reduce the information while it simplifies the modelling. We use these numbers to form the -matrix . Let be the -vector of expression levels for gene , and assume the linear model (1).
It is believed that SNPs that occur within a gene may play a more direct role in the gene’s function than SNPs at other genomic locations [55, 42]. Therefore, it is natural to treat SNPs falling within a given gene differently than the ones not falling within that gene. This gives rise to two groups of SNPs for a given gene, which we can encode as prior knowledge in a -dimensional array P with values and .
4 Posterior inference
In this section we discuss statistical inference for the model (1)-(2). This consists of three steps: the approximation to the posterior distribution of the model for given hyperparameters (and given ), the estimation of the hyperparameters (across ), and finally a method of variable selection.
4.1 Variational Bayes approximation
The variational Bayes approximation to a distribution is simply the closest element in a given target set of distributions, usually with “distance” measured by Kullback-Leibler divergence . In our situation we wish to approximate the posterior distribution of the parameter given in the model (1)-(2), for a fixed . Here we take the regression matrix as given.
Thus the variational Bayes approximation is given as the density that minimizes over ,
where is the posterior density, the expectation is taken with respect to having the density , and and are the joint density of and the marginal density of , respectively, in the model (1)-(2), with prior density on . As the marginal density is free of , minimization of this expression is equivalent to maximization of the second term
By the non-negativity of the Kullback-Leibler divergence, this expression is a lower bound on the logarithm of the marginal density of the observation. For this reason it is usually referred to as “the lower bound”, or “ELBO”, and solving the variational problem is equivalent to maximizing this lower bound.
The set is chosen as a compromise between computational tractability and accuracy of approximation. Restricting to distributions for which all marginals of are independent is known as mean-field variational Bayes, or also as the “naïve factorization” . Here we shall use the larger set of distributions under which the blocks of , , and
-parameters are independent. Thus we optimize over probability densitiesof the form
There is no explicit solution to this optimization problem. However, if all marginal factors but a single one in the factorization are fixed, then the latter factor can be characterised easily, using the non-negativity of the Kullback-Leibler divergence. This leads to an iterative algorithm, in which the factors are updated in turn.
In the Supplementary Material (SM) we show that in our case the iterations take the form:
is the distribution with probability density function proportional to
and the parameters on the right hand side satisfy
In these expressions, is the number of ’s in the -th row of the 2-dimensional array P encoding the groups, ; and is the vector obtained from by replacing the coordinates not corresponding to group by .
The expected value of , which appears in the expression of , , and above, is given in the following lemma.
The norming constant for is and the expectation of if is given by
where is the exponential integral function of order 1, defined by
This follows by easy manipulation and the standard density transform formula. ∎
In addition, the variational lower bound (4) on the log marginal likelihood at takes the form (See SM for details)
4.2 Global Empirical Bayes
Model (2) possesses the pairs of hyperparameters . The pair controls the prior of the error variances ; we fix this to numerical values that render a vague prior, e.g. to . In contrast, we let the values of the parameters be determined by the data. As these hyperparameters are the same in every regression model , this allows information to be borrowed across the regression equations, leading to global shrinkage of the regression parameters. The approach is similar to the one in .
Precisely, we consider the criterion
The maximization of the function on the right with respect to for fixed leads to the variational estimator considered in Section 4.1 (which depends on ). Rather than running the iterations (5) for computing this estimator to “convergence”, next inserting in the preceding display (7), and finally maximizing the resulting expression with respect to , we blend iterations to find and as follows. Given an iterate of (5) we set in (7) equal to and find its maximizer with respect to . Next given we set (in the display following (5) equal to and use (5) to find a next iterate of . We repeat these alternations to “convergence”.
For fixed the far right side in the second row of the preceding display depends on only through
Using the approximation , where is the digamma function, the maximization yields (see SM for details)
where . The following algorithm summarizes the above described procedure.
|Variational algorithm with sparse local-global shrinkage priors|
|, and , , ,|
|2: while and M do|
|E-step: Update variational parameters|
|3: for to update|
|, , , , and ; and in that order|
|M-step: Update hyperparameters|
|4: , ;|
|6: end while|
4.3 Variable selection
Because the horseshoe prior is continuous, the resulting posterior distribution does not set parameters exactly equal to zero, and hence variable selection requires an additional step. We investigated two schemes that both take the marginal posterior distributions of the parameters as input.
A natural method is to set a parameter
equal to zero (i.e. remove the corresponding independent variable from the regression model) if the point 0 is not in the tails of its marginal posterior distribution, or more precisely, if 0 does belong to a central marginal credible interval for the parameter. Given that our variational Bayes scheme produces conditional Gaussian distributions, this is also equivalent to the absolute ratio of posterior mean and standard deviation
not exceeding some threshold. (In the network setup of Section 2 we use the symmetrized quantity , as the two constituents of the average refer to the same parameter.)
To determine a suitable cutoff or credible level we applied the variational Bayes procedure of Section 4.1 with all credible levels on a grid with step size 5% within the range , resulting in a model, or set of ‘nonzero’ parameters , for every . We allow rather lenient credible levels because the model might benefit from the inclusion of fewer variables, in particular when strong collinearity is present. We next refitted the model (1)-(2) with the non-selected parameters set equal to 0, evaluated the variational Bayes lower bound on the likelihood (4) (equivalently (6)), and chose the value of that maximized this likelihood and the corresponding model. When refitting we did not estimate the hyperparameters (’s and ’s for pInc, ’s for pInc2, as explained in Section 4.2), but used the values resulting from the entire data set. Even though this procedure sounds involved, it is computationally fast, because it is free of the empirical Bayes step and typically needs to evaluate only models with few predictors.
4.3.2 An alternative selection scheme
As an alternative selection scheme we investigated the decoupled shrinkage and selection (DSS) criterion proposed by . For each regression model , given the posterior mean vector determined by the pooled procedure of Sections 4.1-4.2, this calculates the adaptive lasso type estimate
and next chooses the model corresponding to the nonzero coordinates of . The authors  advocate this method over thresholding, in particular because it may better handle multi-collinearity. In genomics applications, such as the eQTL Example (Section 6.2), multi-collinearity is likely strong, in particular between neighbouring genomic locations. Another attractive aspect of (9) is that it only relies on the posterior means, which we have shown to be accurately estimated by the variational Bayes approximation.
In the DSS approach the thresholding in order to obtain models of different sizes is performed through the smoothing parameters . The authors 
propose a heuristic to choosebased on the credible interval of the explained variation. An alternative is to apply -fold cross-validation based on the squared prediction error:
where superscript refers to the observations used as test sample in fold , and to the complementary training sample used to calculate , by (9) with and replacing and . Again we throughout fix the hyperparameters of the priors to the ones resulting from the variational Bayes algorithm on the entire data set. We have found that the function can be flat, which, to some extent, is a ‘by-product’ of the strong shrinkage properties of horseshoe prior. (Given a sparse true vector, many posterior means will be close to zero, which renders the DSS solution (9) less dependent on .) To overcome this, and because we prefer sparser models, we used the maximum value of
for which the MSE is within 1 standard error of the minimum of the mean square errors.
In the next sections, if not specified, selection should be understood as the first scheme based on thresholding.
We performed model-based simulations to compare model (2), referred to as pInc, with the alternative method pInc2, in which there is only one parameter per group, and their ridge counterpart ShrinkNet (). We refer to the latter paper for comparisons of ShrinkNet to other competing methods. ShrinkNet was indeed shown in  to outperform the graphical lasso , the SEM Lasso  and the GeneNet  using exactly the same simulated data below. As ShrinkNet was developed for network reconstruction only and does not incorporate prior knowledge, we initially considered the setup of network reconstruction in Section 2 and set in (2). Next we compared pInc and pInc2 in the same network recovery context, but incorporating prior information. Finally, we compared the accuracy and computing time of our variational Bayes approximation approach with Gibbs sampling-based strategies .
5.1 Model-based simulation
We generated data according to (3), for and to reflect high and low-dimensional designs. We generated precision matrices corresponding to band, cluster and hub network topologies [75, 39] from a G-Wishart distribution  with scale matrix equal to the identity and degrees of freedom.
The performance of the methods was investigated using average errors and across replicates of the experiment. Here (or ) is the vector consisting of all nonzero (or zero) values of the partial correlation matrix except the diagonal elements, and (or ) is the vector consisting of the corresponding posterior means.
The results are displayed in Tables 1 and 2. Both methods pInc and pInc2 outperform ShrinkNet in all simulation setups. For the nonzero parameters (‘signals’) pInc and pInc2 are on par, but for the zero parameters pInc outperforms pInc2 for small in the Band and Cluster topologies, but when increases and in the Hub topology this turns around.
Somewhat worrisome is that the performance of all methods on the zero parameters initially seems to suffer from increasing sample size . The empirical Bayes choice of shrinkage level clearly favours strong shrinkage for small , giving good performance on the zero parameters, but relaxes this when the sample size increases. Thus the better performance for increasing on the nonzero parameters is partly offset by a decline in performance on the zero parameters. This balance between zero and nonzero parameters is restored only for relatively large sample sizes. A similar phenomenon was observed in .
Tables 3 and 4 compare the performance of pInc and pInc2 when prior information is available (both with sample size ). The prior information consists either of the correct adjacency matrix for the network (i.e. if and otherwise), or an adjacency matrix in which 50 % of the positive entries are correct. The latter matrix was obtained by swapping a random selection of half the 1s in the correct adjacency matrix with a random selection of equally many 0s. The tables shows that pInc usually outperforms pInc2, the zero parameters in the Hub case with true edge prior knowledge being the only significant exception.
|Quality of prior Info||pInc2||pInc|
|50% true edge info||6.66||5.30|
|50% true edge info||3.25||3.28|
|50% true edge info||0.46||5.88|
|Quality of prior Info||pInc2||pInc|
|50% true edge info||219.57||217.39|
|50% true edge info||286.98||286.73|
|50% true edge info||37.79||34.60|
To study the performance of the different methods on model selection we computed ROC curves, showing the true positive rate (TPR) and false positive rate (FPR) as a function of the threshold on the test statistic (8) for inclusion of a parameter in the model. Figure 1 shows that in the absence of prior information pInc2 performs best, closely followed by pInc, and both methods outperform ShrinkNet. Given either correct or 50% correct information pInc is the winner, as seen in Figure 2, which also shows the usefulness of incorporating prior information. These findings are consistent with the results on estimation presented in Tables 1–4 in their ordering of pInc above pInc2 in the case of availability of external information.
Figure 3 in the supplementary material displays histograms of the EB estimates of prior parameter/hyperparameter ’s by pInc (TauSq) and pInc2 (TauSq2) across the simulation replicates. The initial hyperparameter value for pInc2 was set to . The figure shows that the estimated parameters are bigger (hence less shrinkage) when the sample size is larger. Furthermore, for a fixed sample size the estimates are reasonably stable, the quotient of the largest and smallest across the 50 replicates being below a small constant.
5.2 Variational Bayes vs MCMC
We investigated the quality of the variational approximation by comparing it to the output of a long MCMC run. As we only use the univariate marginal posterior distributions of the regression parameters for model selection, we focused on these. We ran a simulation study with a single regression equation (say ) with , and compared the variational Bayes estimates of the marginal densities with the corresponding MCMC-based estimates. We sampled independent replicates from a -dimensional normal distribution with mean zero and -precision matrix , and formed the vector and matrix as indicated in Section 2. The precision matrix was chosen to be a band matrix with lower and upper bandwidths equal to 4, thus a band of total width 9. For both the variational approximations and the MCMC method we used prior hyperparameters and prior hyperparameters (resp. for pInc2) fixed to the values set by the global empirical Bayes method described in Section 4.2. The MCMC iterations were run times without thinning, after which the first iterates were discarded . Tables 5 and 6 summarize the comparison.
The correspondence between the two methods is remarkably good. The posterior means obtained from the variational method are even slightly better as estimates of the true parameters than the ones from the MCMC method, in terms of -loss. With respect to computing time the variational method was vastly superior to the MCMC method, which would hardly be feasible even for .
in 20 replications ()
computing time needed
for all the 100 regressions
|MCMC method||2.22||13h 15 min|
in 20 replications ()
computing time needed
for all the 100 regressions
|pInc2||2.25||1 min 48 sec|
|MCMC method||3.03||13h 19 min|
We applied the methods to two real datasets, both as illustration.
6.1 Reconstruction of the apoptosis pathway
The cells of multicellular organisms possess the ability to die by a process called programmed cell death or apoptosis, which contributes to maintaining tissue homeostasis. Defects in the apoptosis-inducing pathways can eventually lead to expansion of a population of neoplastic cells and cancer [24, 35, 30]. Resistance to apoptosis may increase the escape of tumour cells from surveillance by the immune system. Since chemotherapy and irradiation act primarily by inducing apoptosis, defects in the apoptotic pathway can make cancer cells resistant to therapy. For this reason resistance to apoptosis remains an important clinical problem.
In this section we illustrate the power of our method in reconstructing the apoptosis network from lung cancer data  from the Gene Expression Omnibus (GEO). The data comprises genes, consisting of observations from normal tissue and observations from tumor tissue, hence observations in total. We fitted pInc on the tumor data, using the data on normal tissue as prior knowledge. To the latter aim we fitted pInc to the normal data with a single group , and applied the model selection procedure of Section 4.3.1 to create an array of incidences, which served as input when fitting pInc on the tumor data. The idea is that, while tumors and normal tissue may differ strongly in terms of mean gene expression, the gene-gene interaction network may be relatively more stable.
When fitting the pInc model with the two groups (gene interaction absent or present in normal tissue), we observed a huge difference in the empirical Bayes estimates of the hyperparameters governing the priors of the parameters of the two groups, namely prior mean for absent and for present in the prior network. This strongly indicates the relevance of the prior knowledge , so that superior performance of pInc in the reconstruction can be expected.
Figure 3 displays the reconstructed undirected network by pInc with Selection procedure 4.3.1. A total number of 27 edges were found with various edge strengths. The ten most significant edges in decreasing order were: PRKACG FASLG, MYD88 CSF2RB, PIK3R2 CHUK, TNFRSF10B CHP1, PRKAR1B AKT2, PIK3R2 NGF, TRAF2 BAX, TNF IL1B, PRKAR2B AKT3, and TRAF2 PIK3R2.
Node degrees varied from 0 to 4 with PIK3R2 and PRKAR1A yielding the highest degree 4, followed by TRAF2 having degree 3, and CHUK, CHP1, BIRC3, FAS, IL1B and NFKBIA having each degree 2.
6.2 eQTL mapping of the p38MAPK pathway
The p38MAPK pathway is activated in vivo by environmental stress and inflammatory cytokines, and plays a key role in the regulation of inflammatory cytokines biosynthesis. Evidence indicates that p38MAPK activity is critical for normal immune and inflammatory response [40, 4, 29]. The pathway also plays an important role in cell differentiation. Its key role in the conversion of myoblasts to differentiated myotubes during myogenic progression has been established by [44, 72, 73]. More recently, in vivo studies demonstrated that p38MAPK signalling is a crucial determinant of myogenic differentiation during early embryonic myotome development . Finally, the pathway is involved in chemotactic cell migration [28, 53]. Lack of p38MAPK function may lead to cell cycle deficiency and tumorigenesis, and genetic variants of some genes in the p38MAPK pathway are associated with lung cancer risk . Studying the pathway in healthy cells may enhance understanding the underlying biological mechanism, but has received less attention.
We investigated the association between single nucleotide polymorphisms (SNPs) and the genes in the P38MAPK pathway, using GEUVADIS data. In the GEUVADIS project , 462 RNA-Seq samples from lymphoblastoid cell lines were obtained, while the genome sequence of the same individuals is provided by the 1000 Genomes Project. The samples in this project come from five populations: CEPH (CEU), Finns (FIN), British (GBR), Toscani (TSI) and Yoruba (YRI). In our analysis we excluded the YRI population samples and samples without expression and genotype data, which resulted in a remaining sample size of 373. We also excluded SNPs with minor allele frequency (MAF) . Using a window of bases upstream and downstream of every gene, we obtained a total number of 42,054 SNPs for the 99 genes of the pathway belonging to the 22 autosomes. This resulted in a system of 99 regression models, with dimensions varying from 56 to 1169. We scaled (per gene) the gene expression data prior to the computations.
Following Section 3
we classified the SNPs connected to each gene as located either within the gene range or outside, and appliedpInc with two groups (). We observed a big difference in the empirical Bayes estimates of the hyperparameters of the priors of : mean value for SNPs outside the gene ranges versus for SNPs inside. The prior information is thus clearly relevant, and hence an improved mapping by pInc can be expected.
We found using Selection procedure 4.3.1 the expression levels of 13 out of the 99 genes (genes 15, 40, 48, 50, 51, 61, 75, 78, 85, 86, 93, 96, 98) to be associated with a total number of 50 SNPs from the 42,054 SNPs under consideration. Gene 50 yielded the highest number 9 of associated SNPs, followed by gene 40 with 6 SNPs and genes 86, 93 and 96 with 5 SNPs each. Figures 4 and 5 display the estimates of the effect sizes of the SNPs (posterior means ), green for SNPs outside the gene ranges and blue for SNPs within a gene, with ‘red stars’ indicating the SNPs that were selected by the selection procedure presented in Section 4.3.1. The 6 largest associations were observed within genes 93, 15, 96, 98 and 78 (red vertical lines in Figures 4 and 5). The active SNPs for all genes, except genes 40 and 50 (although for gene 50 only one of the selected SNPs is not within), are located inside the gene range. This confirms the belief that SNPs falling inside genes are more prone to influence these genes than SNPs outside.
The SNP effects on the remainder 86 () genes are similar to the ones on gene 1 displayed in Figure 5. The selection obtained by using pInc-DSS is similar (see SM).
6.2.1 Comparison of pInc-DSS with lasso
From the many dedicated methods for eQTL analysis [8, 43, 56, 41, 32], we chose the lasso as a bench-mark to compare the model selection by pInc combined with DSS 4.3.2. Our choice for DSS comes from the interest to investigate whether ’pInc
+ lasso’ indeed outperforms a direct lasso, as suggested for the basic horseshoe. As a criterion we used predictive performance when using a sparse model restricted to include a maximal number of predictor variables (SNPs). As for the lasso, the number of selected variables is easy to control bypInc-DSS, because the entire trace of the adaptive lasso (9) is available. To evaluate predictive performance, we used a single 2/3-1/3 split of the data, leading to training and test sets of and observations, respectively. The lasso was computed using GLMnet by , also for (9).
The four panels of Figure 6 report the results for the maximal number of predictor variables set equal to , , , or . The vertical axis shows the relative reduction of the MSE on the test set as compared to the empty model (all ), defined by
where is the MSE of the empty model and the MSE of linear model . This quantity was calculated for all 99 genes in the pathway (horizontal axis), for both the lasso (displayed in black) and pInc-DSS (displayed in red), large values indicating accurate prediction. The results of the lasso are somewhat more ‘noisy’, likely due to less shrinkage of the (near-)zero parameter estimates, and the lasso regularly performs inferior to both the empty model (negative values) and pInc-DSS, with gene 13 an extreme case. For genes with considerable signal w.r.t. the empty model (e.g. genes 61, 93 and 98), pInc-DSS explains much more of the signal than the lasso. This could be explained by less shrinkage of the non-zero parameters by the horseshoe prior, which is designed to separate zero and nonzero values. This is illustrated in Figure 7 for gene 98. Gene 50 is the one exception, where lasso beats pInc-DSS, in the case of selecting 3 variables.
We have introduced a sparse high-dimensional regression approach that can incorporate prior information on the regression parameters and can borrow information across a set of similar datasets. It is based on an empirical Bayesian setup, where external information is incorporated through the prior, and information is borrowed across similar analyses by empirical Bayes estimation of hyperparameters. We have shown the power of the approach both in model-based simulations of Gaussian graphical models and in real data analyses in genomics. Incorporating the information was shown to enhance the analysis, even when the prior information was only partly correct (e.g. 50 % accurate). We explain this by the fact that the empirical Bayesian approach is able to incorporate prior information in a soft manner. Such a flexible approach is particularly attractive in high-dimensional situations where the amount of data is small relative to the number of parameters and an increasing amount of prior information is available.
To make our approach scalable to large models and/or datasets we developed a variational Bayes approximation to the posterior distribution resulting from the horseshoe prior distribution. We showed the accuracy of the resulting approximation to the marginal posterior distributions of the regression parameters by comparison to state-of-the-art MCMC schemes for the horseshoe prior. The variational Bayes approach obtained the same (if not better) accuracy at a fraction of CPU time.
We studied two versions of the model, one with a gamma prior on the ‘sparsity’ parameters and one in which these parameters are estimated by the empirical Bayes method. We found that the gamma prior is preferable when relevant prior knowledge can be used, but in the absence of prior knowledge the alternative model may be preferable.
-  C. Archambeau and F.R. Bach. Sparse probabilistic projections. In Advances in Neural Information Processing Systems 21, pages 73–80. Curran Associates, Inc., 2008.
Variational bridge regression.
Journal of Machine Learning Research, Workshop and Conference Proceedings, 5:17–24, 2009.
Inferring parameters and structure of latent variable models by
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, pages 21–30. Morgan Kaufmann Publishers Inc., 1999.
-  L.S. Berenson and et al. Selective requirement of p38alpha mapk in cytokine-dependent, but not antigen receptor-dependent, th1 responses. J. Immunol., 176:4616–4621, 2006.
-  A. Bhattacharya, A. Chakraborty, and B.K. Mallick. Fast sampling with gaussian scale-mixture priors in high-dimensional regression. Biometrika, 103(4):985–991, 2016.
-  D.M. Blei and M.I. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1:121 – 143, 2011.
-  D.M. Blei, A. Kucukelbir, and J.D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
-  X. Cai, A. Huang, and S. Xu. Fast empirical bayesian lasso for multiple quantitative trait locus mapping. BMC Bioinformatics, 12(1):211, 2011.
-  P. Carbonetto and M. Stephens. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis, 7(1):73–108, 2012.
-  C.M. Carvalho, N.G. Polson, and J.G. Scott. Handling sparsity via the horseshoe. Journal of Machine Learning Research, W&CP, 5:73–80, 2009.
-  C.M. Carvalho, N.G. Polson, and J.G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97:465–480, 2010.
-  V.G. Cheung, R.S. Spielman, K.G. Ewens, T.M. Weber, M. Morley, and J.T. Burdick. Mapping determinants of human gene expression by regional and genome-wide association. Nature, 437:1365 – 1369, 2005.
-  J. Datta and J.K. Ghosh. Asymptotic properties of bayes risk for the horseshoe prior. Bayesian Analysis, 8(1):111–132, 2013.
-  L. de Angelis and et al. Regulation of vertebrate myotome development by the p38 map kinase-mef2 signaling pathway. Dev. Biol., 283:171–179, 2005.
A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao, and M. West.
Sparse graphical models for exploring gene expression data.
Journal of Multivariate Analysis, 90:196 – 212, 2004.
-  Y. Feng and et al. Novel genetic variants in the p38mapk pathway gene zak and susceptibility to lung cancer. Mol. Carcinog., 57(2):216–224, 2018.
-  J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432–441, 2008.
-  J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1–22, 2010.
-  M. Galassi. GNU Scientific Library : reference manual for GSL version 1.12. Network Theory, Bristol, UK, 3rd edition, 2009.
-  R. Giordano, T. Broderick, and M.I. Jordan. Covariances, robustness, and variational bayes. arXiv preprint arXiv:1709.02536, 2017.
-  P. Giudici and P. J. Green. Decomposable graphical gaussian model determination. Biometrika, 86:785 – 801, 1999.
-  P. Richard Hahn and Carlos M. Carvalho. Decoupling shrinkage and selection in bayesian linear models: A posterior summary perspective. Journal of the American Statistical Association, 110(509):435–448, 2015.
-  J.S. Hamid, P. Hu, N.M. Roslin, V. Ling, C.M.T. Greenwood, and J. Beyene. Data integration in genetics and genomics: Methods and challenges. Human Genomics and Proteomics: HGP, 2009:869093.
-  D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100:57–70, 2000.
-  R.K.S. Hankin. Special functions in r: introducing the gsl package. R News, 6, 2007.
-  Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015.
-  R.D. Hawkins, G.C. Hon, and B. Ren. Next-generation genomics: an integrative approach. Nat. Rev. Genet., 11(7):476–86, 2010.
-  B. Heit, S. Tavener, E. Raharjo, and P. Kubes. An intracellular signaling hierarchy determines direction of migration in opposing chemotactic gradients. J. Cell Biol., 159:91–102, 2002.
-  D. Hwang, B.C. Jang, G. Yu, and M. Boudreau. Expression of mitogen- inducible cyclooxygenase induced by lipopolysaccharide: mediation through both mitogen-activated protein kinase and nf-kappab signaling pathways in macrophages. Biochem. Pharmacol., 54:87–96, 1997.
-  F.H Igney and P.H. Krammer. Death and anti-death: tumour resistance to apoptosis. Nat Rev Cancer., 2:277–88, 2002.
-  B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high dimensional graphical models. Statist. Sci., 20:388 – 400, 2005.
-  S. Kim and E.P. Xing. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. Ann. Appl. Stat., 6(3):1095 – 1117, 2012.
-  G.B. Kpogbezan, A.W. van der Vaart, W.N. van Wieringen, G.G.R. Leday, and M.A. van de Wiel. An empirical bayes approach to network recovery using external knowledge. Biom. Journal, 59(5):932–947, 2017.
-  N. Krämer and et al. Regularized estimation of large-scale gene association networks using graphical gaussian models. BMC Bioinformatics, 10:384, 2009.
-  P. H. Krammer, P. R. Galle, P. Moller, and K. M. Debatin. Cd95(apo-1/fas)-mediated apoptosis in normal and malignant liver, colon, and hematopoietic cells. Adv. Cancer Res., 75:251–273, 1998.
-  M. T. Landi, T. Dracheva, M. Rotunno, J.D. Figueroa, A. Dasgupta H. Liu, F.E. Mann, J. Fukuoka, M. Hames, A.W. Bergen, S.E. Murphy, P. Yang, A.C. Pesatori, D. Consonni, P.A. Bertazzi, S. Wacholder, J.H. Shih, N.E. Caporaso, and J. J. Jen. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE, 3:e1651, 02 2008. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10072.
-  T. Lappalainen and et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468):506–511, 2013.
-  S. Lauritzen. Graphical models. The Clarendon Press, Oxford University Press, New York, 1996.
-  G.G.R. Leday, M.C.M. de Gunst, G.B. Kpogbezan, A.W. van der Vaart, W.N. van Wieringen, and M.A. van de Wiel. Gene network reconstruction using global-local shrinkage priors. Ann. Appl. Stat., 11(1):41 – 68, 2017.
-  J.C. Lee and et al. A protein kinase involved in the regulation of inflammatory cytokine biosynthesis. Nature, 372:739–746, 1994.
-  S.I. Lee, D. Pe’er, A. Dudley, G. Church, and D. Koller. Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc. Natl. Acad. Sci. USA, 103:14062 – 14067, 2006.
-  B. Lehne, C.M. Lewis, and T. Schlitt. From snps to genes: Disease association at the gene level. PLos ONE, 6(6):e20133, 2011.
-  G. Li, A.A. Shabalin, I. Rusyn, F.A. Wright, and A.B. Nobel. An empirical bayes approach for multiple tissue eqtl analysis. Biostatistics, 19(3):391–406, 2018.
-  Y. Li, B. Jiang, W.Y. Ensign, P.K. Vogt, and J. Han. Myogenic differentiation requires signalling through both phosphatidylinositol 3-kinase and p38 map kinase. Cell. Signal., 12:751–757, 2000.
-  D.J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, New York, NY, USA, 2003.
-  C.A. McGrory and D.M. Titterington. Variational approximations in bayesian model selection for finite mixture distributions. Computational Statistics and Data Analysis, 51:5352 – 5367, 2007.
-  N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34:1436–1462, 2006.
-  A. Mohammadi and E. C. Wit. Bayesian structure learning in sparse gaussian graphical models. Bayesian Anal., 10:109 – 138, 2015.
-  J.T. Ormerod and M.P. Wand. Explaining variational approximations. The American Statistician, 64(2):140–153, 2010.
-  N.G. Polson, J.G Scott, and Windle J. The bayesian bridge. Journal of the Royal Statistical Society: Series B, 76:713 – 733, 2014.
-  M.A. Pujana, J.J. Han, L.M. Starita, K.N. Stevens, M. Tewari, J.S. Ahn, G. Rennert, V. Moreno, T. Kirchhoff, and B. et al. Gold. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nature Genetics, 39:1338 – 1349, 2007.
-  D.M. Reif, B.C. White, and J.H. Moore. Integrated analysis of genetic, genomic and proteomic data. Expert Rev. Proteomics, 1:67–75, 2004.
-  U.R. Ezekiel R.O. Webster R.M. Heuertz, S.M. Tricomi. C-reactive protein inhibits chemotactic peptide-induced p38 mitogen-activated protein kinase activity and human neutrophil movement. J. Biol. Chem., 274:17968–17974, 1999.
-  J. Schäfer, R. Opgen-Rhein, and K. Strimmer. Reverse engineering genetic networks using the GeneNet package. R News, 6:50–53, 2006.
-  N.W.J. Schröder and R.R. Schumann. Single nucleotide polymorphisms of toll-like receptors and susceptibility to infectious disease. Lancet Infect Dis., 5:156 – 64, 2005.
-  E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics, 34:166 – 178, 2003.
-  B.E. Stranger, M.S. Forrest, A.G. Clark, M.J. Minichiello, S. Deutsch, and R. et al. Lyle. Genome-wide associations of gene expression variation in humans. PLoS Genetics, 1:695 – 704, 2005.
-  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
-  R.E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models. In D. Barber, T. Cemgil, and S. Chiappa, editors, Bayesian Time series models. Cambridge University Press, 2011.
-  M.A. van de Wiel, G.G.R. Leday, L. Pardo, H. Rue, A.W. van der Vaart, and W.N. van Wieringen. Bayesian analysis of rna sequencing data by estimating multiple shrinkage priors. Biostatistics, 14:113 – 128, 2013.
-  Mark van de Wiel, Dennis Te Beest, and Magnus Münch. Learning from a lot: Empirical bayes for high-dimensional model-based prediction. Scandinavian Journal of Statistics, pages 1–24, 2018.
-  S.L. van der Pas, B.J.K. Kleijn, and A.W. van der Vaart. The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics, 8(2):2585–2618, 2014.
-  Stéphanie van der Pas, Botond Szabó, and Aad van der Vaart. Adaptive posterior contraction rates for the horseshoe. Electron. J. Stat., 11(2):3196–3225, 2017.
-  Stéphanie van der Pas, Botond Szabó, and Aad van der Vaart. Uncertainty quantification for the horseshoe (with discussion). Bayesian Anal., 12(4):1221–1274, 2017. With a rejoinder by the authors.
-  W.N. van Wieringen and C.F.W. Peeters. Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103:284 – 303, 2016.
-  M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
-  M.P. Wand, J.T. Ormerod, S.A. Padoan, and R. Frühwirth. Mean field variational bayes for elaborate distributions. Bayesian Analysis, 6(4):847–900, 2011.
-  B. Wang and M. Titterington. Inadequacy of interval estimates corresponding to variational bayesian approximations. In Workshop on Artificial Intelligence and Statistics, 2004.
-  Y. Wang and D.M. Blei. Frequentist consistency of variational bayes. arXiv preprint arXiv:1705.03439, 2017.
-  T. Westling and T.H. McCormick. Establishing consistency and improving uncertainty estimates of variational inference through m-estimation. arXiv preprint arXiv:1510.08151, 2015.
-  J. Whittaker. Graphical models in applied multivariate statistics. Wiley, Chichester, 1990.
-  Z. Wu and et al. p38 and extracellular signal-regulated kinases regulate the myogenic program at multiple steps. Mol. Cell. Biol., 20:3951–3964, 2000.
-  A. Zetser, E. Gredinger, and E. Bengal. p38 mitogen-activated protein kinase pathway promotes skeletal muscle differentiation. participation of the mef2c transcription factor. J. Biol. Chem., 274:5193–5200, 1999.
-  B. Zhang and S. Horvath. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol., 4:Art. 17, 45 pp, 2005.
-  T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional undirected graph estimation in r. J. Mach. Learn. Res., 13:1059 – 1062, 2012.
-  J. Zhu, B. Zhang, E.N. Smith, B. Drees, R.B. Brem, L. Kruglyak, R.E. Bumgarner, and E.E. Schadt. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics, 40:854–861, 2008.