A Bayesian Approach to Joint Estimation of Multiple Graphical Models

02/10/2019 ∙ by Peyman Jalali, et al. ∙ 0

The problem of joint estimation of multiple graphical models from high dimensional data has been studied in the statistics and machine learning literature, due to its importance in diverse fields including molecular biology, neuroscience and the social sciences. This work develops a Bayesian approach that decomposes the model parameters across the multiple graphical models into shared components across subsets of models and edges, and idiosyncratic ones. Further, it leverages a novel multivariate prior distribution, coupled with a pseudo-likelihood that enables fast computations through a robust and efficient Gibbs sampling scheme. We establish strong posterior consistency for model selection, as well as estimation of model parameters under high dimensional scaling with the number of variables growing exponentially with the sample size. The efficacy of the proposed approach is illustrated on both synthetic and real data. Keywords: Pseudo-likelihood, Gibbs sampling, posterior consistency, Omics data

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of joint estimation of multiple related Gaussian graphical models has attracted a lot of interest in the statistics and machine learning due to its wide application in biomedical studies involving Omics data -e.g. Pierson et al. (2015) and Kling et al. (2015)-, as well as text mining and roll call voting (Guo et al. (2011)). The key idea is to “borrow strength” across the related models and thus enhance the “effective” sample size used for estimation of the model parameters, which is achieved primarily by the use of various penalty functions. Specifically, Guo et al. (2011), who first formulated the problem, modeled the elements of each inverse covariance matrix as a product of a common across all models component and an idiosyncratic (model specific) component and imposed an penalty on each one; thus, when the penalty sets the common component to zero, the corresponding edge is absent across all models, whereas if the common component is not zero, edges can be absent because the penalty sets the idiosyncratic one to zero for selected models. Another set of approaches aims to achieve a certain amount of “fusing” across all models under consideration, thus focusing both of the presence of common edges, as well as their absence across all models simultaneously. Examples of such approaches include Danaher et al. (2014) that employed a group lasso and/or a fused lasso penalty on each edge parameter across all models and Cai et al. (2016) that used a mixed norm for the same task. Variants of the above approaches with modifications to the penalties have also been explored (Zhu et al. (2014), Majumdar and Michailidis (2018)).

However, in many application settings, shared connectivity patterns across models occurs only for a subset of edges, while the remaining ones exhibit different connectivity patterns in each model. In other settings, subsets of edges share common connectivity patterns across only a subset of models. In both instances, the previously mentioned approaches will exhibit a rather poor performance in discovering these more complex patterns. To address this issue, Ma and Michailidis (2016) proposed a supervised approach based on fusing through a group lasso penalty, wherein the various connectivity patterns across subsets of edges and subsets of models are a priori known. An alternative supervised approach Saegusa and Shojaie (2016) employed a similarity graph penalty for fusing across models, coupled with an penalty for obtaining sparse model estimates. The similarity graph is assumed to be a priori known. A Bayesian variant of the latter approach was introduced in Peterson et al. (2015), wherein a Markov random field prior distribution was used to capture model similarity, followed by a spike-and-slab prior distribution on the edge model parameters. Another Bayesian approach was recently developed in Tan et al. (2017) which, similar to Peterson et al. (2015), uses -Wishart prior distributions on the group-wise precision matrices given the sparsity patterns/networks in each group, and then employs a multiplicative model based hierarchical prior distribution on these networks to induce similarity/dependence.

Most of the frequentist approaches reviewed above come with performance guarantees in the form of high probability errors bounds for the model parameters as a function of the number of variables (nodes), sparsity level and sample size. Some recent work has focused on constructing confidence intervals for the difference in the magnitude of the edge parameter across two models that are estimated

separately using an penalization. On the other hand, theoretical guarantees based on high-dimensional posterior consistency results are not available for the Bayesian approaches mentioned above. Also, these approaches can suffer from computational scalability/efficiency issues in moderate/high dimensional settings, say in the presence of nodes/variables.

The key objective of this paper is to develop a scalable approach to jointly estimate multiple related Gaussian graphical models that exhibit complex edge connectivity patterns across models for different subsets of edges. To that end, we introduce a novel Subset Specific () prior that for each edge aims to select the subset of models it is common to. We couple it with the Gaussian pseudo-likelihood used in Khare et al. (2015)

for estimating a single Gaussian graphical model, that leads to an easy to implement and scalable Gibbs sampling scheme for exploring the posterior distribution. Finally, we establish strong posterior model selection consistency results that can be leveraged for construction of credible intervals for the edge parameters. Intuitively, the proposed framework achieves the objectives set forth in the

Ma and Michailidis (2016) work, without requiring a priori specification of the shared edge connectivity patterns; thus, the approach can be considered as fully unsupervised.

The remainder of the paper is organized as follows. Section 2 formulates the problem, while Section 3 introduces the prior. In Section 4, we combine this prior with a Gaussian pseudo-likelihood to obtain a (pseudo) posterior distribution of the model parameters, and discuss how to sample from the posterior distribution. Section 5 establishes high-dimensional consistency properties for this posterior distribution. Section 6 presents extensive numerical results based on synthetic data for the framework’s performance and comparisons with existing approaches in the literature. Section 7 presents an application to metabolomics data from a case-control study on Inflammatory Bowel Disease.

2 Framework for Joint Estimation

Suppose we have data from a priori defined groups. For each group (), let denote -dimensional i.i.d

observations from a multivariate normal distribution, with mean

and covariance matrix , which is specific to group . Based on the discussion in the introductory section, the precision matrices can share common patterns across subsets of the models, as delineated next. Our goal is to account for these shared structures.

Let denote the power set of } and for , define as follows:

(2.1)

It is easy to check that each is the collection of subsets which contain , and has members. Denote by the matrix that contains common patterns amongst precision matrices . Specifically, for any singleton set , the matrix contains a pattern that is unique to group , while for any other set containing more than a single element, captures edge connectivity patterns (and their magnitudes) that are common across all members in . For example, contains shared structures in , , and .

Therefore, each precision matrix can be decomposed as

(2.2)

where accounts for all the structures in which are either unique to group (i.e. ) or are shared exclusively between group and some combination of other groups (i.e. ). We further assume that for , where denotes the space of all matrices with positive diagonal entries. Finally, the diagonal entries of every joint matrix , with are set to zero; in other words, the diagonals entries of are contained in the corresponding .

To illustrate the notation, consider the case of groups. Then, each precision matrix is decomposed as

where the , , and matrices contain group specific patterns, the , , matrices contain patterns shared across pairs of models (for subsets of the edges) and finally matrix contains patterns shared across all models/groups.

2.1 Identifiability Considerations

A moment of reflection shows that the model decomposition (

2.2) is not unique. For example, for any arbitrary matrix , the model (2.2) is equivalent to with and . Hence, without imposing appropriate identifiability constraints, meaningful estimation of the model parameters is not feasible.

In order to address this issue, we first rewrite the element-wise representation of model (2.2):

(2.3)

where , and are the coordinates of the matrices and

, respectively. We only consider the upper off-diagonal entries due to the symmetry of the precision matrix and thus define vectors

for every , as follows

(2.4)

where each has distinct parameters. For identifiability purposes we require that each vector has at most one non-zero element. Note that under this constraint, if an edge is shared amongst many groups, the non-zero element will be allocated to the maximal set , while all subsets of will be allocated a zero value. Further, the magnitude of all will be the same. As an example, consider again the case of groups and an edge shared amongst all three groups. In this case, the edge will be allocated to the component and not to any other components, such as or . Hence will be non-zero, but

Next, we construct a novel prior distribution that respects the introduced identifiability constraint.

3 Subset Specific () Prior Distribution

For any generic symmetric matrix , define

where due to the symmetric nature of , the vector contains all the off-diagonal elements, while the diagonal ones. In particular, is the vectorized version of the off-diagonal elements of .

Using the above notation, define to be the vector obtained by combining the vectors . To illustrate, for groups, is given by

(3.1)

In view of (2.4), it can be easily seen that that is a rearrangement of the vector . Thus, according to the location of the zero coordinates in ( possibilities), there are possible sparsity patterns across the groups that can have. Let be a generic sparsity pattern for and denote the set of all the sparsity patterns by . To illustrate, consider groups and variables. In this case, each matrix has 3 off-diagonal edges (). Now, assume edge is shared between the two groups, edge 13 is unique to group 2, and edge 23 is absent in both groups. In this case, is given by and the sparsity pattern extracted from is as follows:

For every sparsity pattern , let be the density (number of non-zero entries) of , and be the space where varies, when restricted to follow the sparsity pattern . We specify the hierarchical prior distribution as follows:

(3.2)
(3.3)

where is a diagonal matrix whose entries determine the amount of shrinkage imposed on the corresponding elements in , is a sub-matrix of obtained after removing the rows and columns corresponding to the zeros in , and and are edge inclusion probabilities, respectively, for the case of sparse () and dense () . Later in our theoretical analysis, we specify values for , , and the threshold . Let be the vector containing the non-zero coordinates of . Then, the prior in (3.2) corresponds to putting an independent normal prior on each entry of .

Using the prior distribution posited in (3.2) and (3.3), we derive the marginal prior distribution on , as follows

(3.4)

where

(3.5)

In other words, can be regarded as a mixture of multivariate normal distributions of dimensions that is obtained after projecting a larger dimension () multivariate normal distribution into the union of all the subspaces ; namely, . Note that the prior induces sparsity on , which will be helpful for model selection purposes. Further, the prior respects the identifiability constraint by forcing at least parameters to be exactly equal to zero. In addition to forcing sparsity, the diagonal entries of enforce shrinkage to the corresponding elements in . We shall discuss the selection of these shrinkage parameters later in Section 4.2.

Note that the vector only incorporates the off-diagonal entries of matrices. Regarding the diagonal entries, for every , we let be the vector comprising of the diagonal elements of the matrix and define to be the vector of all diagonal vectors , i.e.

(3.6)

We assign an independent Exponential() prior on each coordinate of (diagonal element of the matrices , ), i.e,

(3.7)

The selection of the hyperparameter

is also discussed in Section 4.2. Since the diagonal entries of every joint matrix , with are set to zero, the specification of the prior is now complete.

4 The Bayesian Joint Network Selector (BJNS)

Estimation of the model parameters is based on a pseudo-likelihood approach. The pseudo-likelihood, which is based on the regression interpretation of the entries of

, can be regarded as a weight function and as long as the product of the pseudo-likelihood and the prior density is integrable over the parameter space, one can construct a (pseudo) posterior distribution and carry out Bayesian inference. The main advantage of using a pseudo-likelihood, as opposed to a full Gaussian likelihood, is that it allows for an easy to implement sampling scheme from the posterior distribution and in addition provides more robust results under deviations from the Gaussian assumption, as illustrated by work in the frequentist domain

Khare et al. (2015); Peng et al. (2009). Note that the pseudo-likelihood approach does not respect the positive definite constraint on the precision matrices under consideration, but since our primary goal is estimating the skeleton of the underlying graphs this mitigates this issue. Further, accurate estimation of the magnitude of the estimated edges can be accomplished through a refitting step of the model parameters restricted to the skeleton, as shown in Ma and Michailidis (2016). We will also establish high-dimensional sparsity selection and estimation consistency for our procedure later in Section 5.

Let denote the sample covariance matrix of the observations in the group. Based on the above discussion, we employ the CONCORD pseudo-likelihood introduced in Khare et al. (2015),

and the model specification (2.2) to construct the joint pseudo-likelihood function for precision matrices, as follows,

(4.1)

Since we have parametrized the prior in terms of , we will rewrite the above pseudo-likelihood function in terms of , as well. Some straightforward algebra shows that

(4.2)

where, is a symmetric matrix whose entries are either zero or a linear combination of ; is a diagonal matrix with entries ; is a vector whose entries are depend on and ; and finally is a matrix such that . To see the algebraic details of the equality in (4.2), structures of and we refer the reader to Supplemental section S1.

Now, letting and by applying Bayes’ rule, the posterior distribution of is given by

(4.3)

Moreover, the conditional posterior distribution of given is given by

(4.4)

while that of given by

(4.5)

where , for and .

Next, we discuss a Gibbs sampling algorithm to generate approximate samples from the above posited posterior distribution.

4.1 Gibbs Sampling Scheme for BJNS

Generating exact samples from the multivariate distribution in (4.3) is not computationally feasible. We instead generate approximate samples from (4.3), by computing the full conditional distribution of the vectors s, () and that of the diagonal entries (, ).

As discussed earlier, each contains elements of which at most one is non-zero. For ease of exposition, let denote the element of for (based on (2.4), every represents a , for some ). Using the same notation for the shrinkage parameters (diagonal elements of ), let be the shrinkage parameter corresponding to . Since there are possibilities for the location of the zeros in each , one can see as an element in one of the disjoint spaces , , …, , where is the singleton set consisting of the zero vector of length and () is the space spanned by the unit vector of length .

Denote by the sub vector of obtained by removing . Define to be the sub-matrix of obtained by removing the rows and columns with indices corresponding to the zero elements of inside . Similarly, let be the sub-matrix of obtained by removing the rows with indices corresponding to zero elements of inside and columns with indices corresponding to zero elements of inside . Further, let be the sub-vector of whose indices match those of inside . Then, the conditional posterior distribution of given and is given by

Note that since has at most one non-zero element, all cross products in are equal to zero, i.e.

where is the diagonal element of matrix , for

. Hence, denoting the univariate normal probability density function by

, we get

with

(4.6)

for , and . Hence, for we can write,

(4.7)

The above density is a mixture of univariate normal densities and the cost of generating samples from this density is comparable to that of generating from a univariate normal distribution.

In view of (4.5), one can also easily see that conditional on , the diagonal entries (, ) are a posteriori independent and their conditional posterior density given is as follows,

(4.8)

where , for and .

Note that the density in (4.8) is not a standard density but an efficient algorithm to sample from this density is provided in Supplemental section S2.

4.2 Selection of Shrinkage Parameters

Let and be generic elements of and and let and be their corresponding shrinkage parameters. Selecting appropriate values for the latter is an important task. In other Bayesian analysis of high dimensional models, shrinkage parameters are usually generated based on an appropriate prior distribution; see Park and Casella (2008); Kyung et al. (2010); Hans (2009)

) for regression analysis and

Wang et al. (2012) for graphical models. We assign independent gamma priors on each shrinkage parameter or ; specifically, , for some hyper-parameters and . The amount of shrinkage imposed on each element and can be calculated by considering the posterior distribution of and , given . Straightforward algebra shows that

Thus, we shrink and on average by and , respectively. That is, our approach selects the shrinkage parameters and based on the current values of and in a way that larger (smaller) entries are regularized more (less). The selection of the hyper-parameters and is also an important task and can significantly affect performance. Based on numerical evidence from synthetic data, we set and ; a similar suggestion is also made in Wang et al. (2012).

Finally, the construction of the Gibbs sampler proceeds as follows: matrices

are initialized as the identity matrix, while

at zero. Then, in each iteration of the Markov Chain Monte Carlo chain, we update the vectors

in (2.4) and the diagonal entries , one at a time, using their full conditional posterior densities given in (4.7) and (4.8), respectively. Algorithm 1 describes one iteration of the resulting Gibbs Sampler.

1:procedure BJNS Input
2:     for  do
3:         for  do
4:              for  do
5:                  
6:                  
7:                  
8:                  
9:              end for
10:              
11:              
12:              if  then
13:                  
14:              end if
15:         end for
16:         for  do
17:              
18:              
19:              Update using Algorithm 1 in Supplemental section S2
20:         end for
21:     end for
22:     for  do
23:         
24:         
25:         Update using Algorithm 1 in Supplemental section S2
26:     end for
27:     return Return the set of updated parameters
28:end procedure
Algorithm 1 Gibbs Sampler for BJNS
Remark 1.

Note that although BJNS is a completely unsupervised approach, available prior knowledge on shared patterns across the groups can be easily incorporated by removing redundnant components . Further, prior information could be incorporated through appropriate specification of the edge selection probabilities .

Remark 2.

The Gibbs sampler described in Algorithm 1 does not involve any matrix inversion which is critical for its computational efficiency.

4.3 Procedure for Sparsity Selection

Note that the conditional posterior probability density of the off-diagonal elements of

is a mixture density that puts all of its mass on the events , where is the number of non-zero coordinates of . This property of BJNS allows for model selection, in the sense that in every iteration of the Gibbs sampler one can check whether or which element of

(there could be at most one non- zero element) is non-zero. Finally, in the end of the procedure, we choose the event with the highest frequency during sampling. In addition, credible intervals can be constructed, by using the empirical quantiles of the values corresponding to the frequency distribution during sampling.

5 High Dimensional Sparsity Selection Consistency and Convergence Rates

Next, we establish theoretical properties for BJNS. Let be the true values of the matrices in (2.2), so that corresponds to the true decomposition of each precision matrix . Using a similar ordering as in (3.1), we define to be the vectorized version of the off-diagonal elements of the true matrices .

The following assumptions are made to obtain the results.

Assumption 1.

(Accurate Diagonal estimates) There exist estimates , such that for any , there exists a constant , such that

(5.1)

with probability at least .

Note that our main objective is to accurately estimate the off-diagonal entries of all the matrices present in the model (2.2). Hence, as commonly done for pseudo-Likelihood based high-dimensional consistency proofs -see Khare et al. (2015); Peng et al. (2009)- we assume the existence of accurate estimates for the diagonal elements through Assumption 5.1. One way to get the accurate estimates of the diagonal entries is discussed in Lemma 4 of Khare et al. (2015). Denote the resulting estimates of the vectors and by and , respectively. We now consider the accuracy of the estimates of the off-diagonal entries obtained after running the BJNS procedure with the diagonal entries fixed at .

Assumption 2.

This assumption essentially states that the number of variables has to grow slower than . Similar assumptions have been made in other high dimensional covariance estimation methods e.g. Banerjee et al. (2014), Banerjee and Ghosal (2015), Bickel and Levina (2008), and Xiang et al. (2015).

Assumption 3.

There exists , independent of and such that

While building our method, we assumed that the data in each group comes from a multivariate normal distribution. The above assumption allows for deviations from Gaussianity. Hence, Theorem 1 below will show that the BJNS procedure is robust (in terms of consistency) under misspecification of the data generating distribution, as long as its tails are sub-Gaussian.

Assumption 4.

(Bounded eigenvalues). There exists

, independent of and , such that for all ,

This is a standard assumption in high dimensional analysis to obtain consistency results, see for example Bühlmann and Van De Geer (2011).

Henceforth, we let , , , .

Assumption 5.

(Signal Strength). Let be the smallest non-zero entry (in magnitude) in the vector . We assume .

This is again a standard assumption. Similar assumptions on the appropriate signal size can be found in Khare et al. (2015); Peng et al. (2009).

Assumption 6.

(Decay rate of the edge probabilities). Let , , and , for some constant .

This can be interpreted as a priori penalizing matrices with too many non-zero entries. We have faster rate for the case of super dense matrices. Similar assumptions are common in the literature, see for example Narisetty et al. (2014) and Cao et al. (2016).

We now establish the main posterior consistency result. In particular, we show that the posterior mass assigned to the true model converges to one in probability (under the true model).

Theorem 1.

(Strong Selection Consistency) Based on the joint posterior distribution given in (4.3), and under Assumptions 1-6, the following holds,

(5.2)

Our next result establishes estimation consistency of the BJNS procedure for and also provides a corresponding rate of convergence.

Theorem 2.

(Estimation Consistency Rate) Let be the maximum value (in magnitude) in the vector and assume that can not grow at a rate faster than . Then, based on the joint posterior distribution given in (4.3), and under Assumptions 1-6, there exists a large enough constant (not depending on ), such that the following holds,

(5.3)

The proofs of the above results are provided in the Supplement, section S3.

6 Simulation Studies

In this section, we present three simulation studies to evaluate the performance of BJNS. In the first study, we illustrate the performance of BJNS in two scenarios with four precision matrices, each. In the second simulation, we compare the performance of BJNS with other methodologies, such as Glasso, where the Graphical lasso by Friedman et al. (2008) is applied to each graphical model separately, joint estimation by Guo et al. (2011), denoted by JEM-G, the Group Graphical Lasso denoted by GGL by Danaher et al. (2014), and the Joint Structural Estimation Method denoted by JSEM, by Ma and Michailidis (2016). In the third simulation, we demonstrate the numerical scalability of BJNS, when the number of precision matrices is relatively large.

Throughout, for any precision matrices of dimensions , we generate data as follows,

We assess the model performance using three well known accuracy measures, specificity (SP), sensitivity (SE) and Matthews Correlation Coefficients (MCC) defined as:

(6.1)

where, TP, TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives, respectively. Larger values of any of the above metrics indicates a better sparsity selection produced by the underlying method.

Recall that BJNS estimates the vectors so that only a single coordinate is non-zero; i.e. regardless of the number of networks , BJNS estimates at most off-diagonal parameters and enforces at least of the remaining off-diagonal parameters to be zero. The final decision for the sparsity of each vector in (2.4), is made based on majority voting. For every experiment, we base our inference on 2000 samples that are generated from the MCMC chain after removing 2000 burn-in samples. We ensure the robustness of the presented results, by repeating each experiment over 100 replicate data sets and taking the average of the accuracy measures across the replicates.

Finally, the implementation of BJNS using the Rcpp and Rcpp Armadillo libraries (Eddelbuettel and Sanderson (2014), Eddelbuettel and François (2011)) together with avoidance of matrix inversion operations as previously discussed contributes to its computational efficiency.

6.1 Simulation 1: Four groups ()

We consider two challenging scenarios to examine the performance of BJNS for the case of simultaneously estimating four inverse covariance matrices ().

  1. [label=()]

  2. Four precision matrices , , , and , with different degrees of shared structures. We let the first matrix to be an AR(2) model with , for ; , for ; and