1 Introduction and Related Work
Estimating a network of dependencies among a set of objects is a difficult problem in statistics, particularly in high-dimensional settings or where the observed measurements are noisy. Gaussian Graphical Models (GGM) are a tool for representing such relationships in an interpretable way. In a classical GGM setting, the sparsity pattern of the inverse covariance matrix encodes conditional independence between variables of the graph. Consequently, various estimators have been proposed that reduce the number of parameters by imposing sparsity constraints on . Among these, the popular graphical lasso procedure [9, 18] places a Wishart likelihood on the sample covariance and computes a point estimate of the graph by minimizing the penalised log-likelihood.
In situations where the variables in the network have different types, it is often more interesting to examine the connections between these types as opposed to estimating an entire network of all the associations. Consider the example in gene analysis where the dependency between only a few clinical factors and thousands of genetic markers is required. When we would like to focus on a particular portion of the network, it is useful to limit the estimate to the Markov blanket of the nodes we are interested in. These are the set of nodes that, when conditioned on, render the nodes of interest independent of the rest of the network.
In this paper we provide a Bayesian perspective of estimating the Markov blanket of a set of query variables in an undirected network111Also known as a Markov network or a Markov random field.. Unlike the point estimate of the graphical lasso, the Bayesian view enables the computation of a posterior distribution of the Markov blanket. A Bayesian interpretation of the graphical lasso is presented by Wang . This approach partitions the matrix as shown on the left in Fig. 1. Posterior inference involves iterating through the individual variables to estimate the entire network. In particular, inference of the block relies on estimating both and . However, the coupling of and is a limiting factor that can be avoided in the context of Markov blanket estimation. This idea forms the basis of our paper. An important observation for the model we present here, is that the Wishart likelihood may be factorised such that the blocks and are de-coupled from . This result is provided as Lemma 1 in Section 2. We show that by combining the factorised likelihood with an appropriate choice of prior, we obtain a posterior distribution that preserves this independence structure. Most importantly, this posterior distribution has an analytic form and can hence be sampled from. We formalise this in Section 3 as Theorem 1. A further consequence of this result is Theorem 2 which demonstrates that sampling from the posterior distribution can be done efficiently. Overall, this means that the Markov blanket of query nodes, can be estimated efficiently without explicitly inferring the entire network.
An overview of our approach is presented in Fig. 1, where the matrix is partitioned similarly to . More precisely, we consider the case where . The difference in the shading of the blocks in indicates that estimation of and (and hence ) is invariant of estimating . The core results described in the previous paragraph are summarised.
The remainder of this paper is structured as follows. We begin by exploring the block factorization of the Wishart likelihood in Section 2. We subsequently derive the posterior distribution and construct a Gibbs sampler to efficiently sample from the different blocks in Section 3. Section 4 describes how Bayesian Markov blanket estimation can be extended to deal with mixed data types with the copula framework. Finally, we demonstrate the practical applicability of the scheme in Section 5 with examples of artificial and real data.
2 Likelihood Model
Assume is a given matrix with independent observations. We are interested in estimating the Markov blanket of query variables with respect to the remaining variables in the data matrix. The sample covariance follows the Wishart distribution with degrees of freedom. That is, . Assume that and are partitioned according to
where the matrices have been reordered such that the query variables lie in the upper left block. Given , we would like to infer , the Markov blanket of the variables that constitute the block . We restrict the problem to the case where such that is small, corresponding to the few variables of interest, and is large.
We begin by showing a blockwise independence structure of the likelihood, which builds the foundation of our model.
Let be the Schur complement of the block . Then (, where denotes conditional independence. That is,
The proof of this lemma can be found in the supplementary document, and is analogous to Gupta and Nagar [Chapter 3, pp. 94–95]. A consequence of the conditional independence of the blocks and in Lemma 1 is that we can derive the posterior distribution of the and blocks, and subsequently sample from them to deduce the Markov blanket. We discuss this in detail in the next section.
3 Posterior Inference
In this section we define the prior of our model and note that it also possesses the fundamental blockwise independence structure proved in Lemma 1 for the likelihood. The posterior distribution of and is therefore independent of . We derive this posterior distribution and show that it has an analytical form, which is formulated as Theorem 1. We subsequently demonstrate how to efficiently sample from it in Section 3.2.
3.1 The Posterior Distribution
The natural conjugate prior to our likelihood is the Wishart distribution. However, in order to ensure sparsity, we also use a double exponential prior as in Wang. Since our focus is on the Markov blanket, we only place the latter on the block . This results in a compound prior:
are inverse-Gaussian distributed scale parameters introduced by Wang:
is a hyperparameter. Most importantly, the compound prior does not affect the independence structure proved for the likelihood in Lemma1. Multiplying the prior introduced in Eq. (3) by the likelihood yields the posterior distributions for blocks and .
We now state the main result of this paper. Specifically, we show that the posterior distribution required to estimate the Markov blanket can be expressed in an analytical form.
Let the Matrix Generalised Inverse Gaussian (MGIG) distribution 
be defined by probability density function with parameter:
The posterior conditionals and admit an analytical form:
Vectorised rows of
follow a joint normal distribution
where be the covariance matrix, and be diagonal matrices containing .
follows the Matrix Generalised Inverse Gaussian (MGIG) distribution:
The posterior maintains the conditional independence structure proved for the likelihood in Lemma 1, i.e. , because the compound prior defined in Eq. (3) is element-wise independent. Derivations of the distributions in Eqs. (6) and (7) follow from factorising the posterior and rearranging terms. Relevant calculations are provided in the supplementary document. ∎
Theorem 1 shows that estimation of the Markov blanket of the query variables only requires sampling from the posterior conditionals of and , which both have an analytical form while remaining independent of . Therefore, the amount of parameters in the Markov blanket that need to be estimated, scales linearly with . This is an improvement over the Bayesian graphical lasso  approach, where this number grows quadratically with . Theorem 1 also provides us with the particular distributions to sample from. Next, we demonstrate how this sampling can be done efficiently.
3.2 Efficiency of Sampling from the Posterior
The blockwise Gibbs sampling scheme for estimating the Markov blanket is summarised in Algorithm 1. This sampling scheme consists of iterative resampling of and of , according to their definitions in Theorem 1.
The distribution of is given by Theorem 1. The vectorised rows of follow a joint normal distribution. For , the distribution further simplifies to
The majority of the computational cost incurred in our method arises from sampling from this joint normal distribution. Eq. (8) requires us to invert , which is of size . Note that
cannot be represented as a covariance tensor of a matrix normal distribution. Therefore, naïve inversion ofusing a standard Cholesky decomposition would cost operations. Our efficient sampling strategy exploits the structure of this matrix, which is the foundation of Theorem 2.
Sampling from the distribution in Theorem 1 requires operations.
We expand the Kronecker product of matrix , which comprises blocks of size :
where is the inverted upper diagonal block. We observe a regular structure within the blocks in : Matrices are added to the diagonals blocks only, and the non-diagonal blocks only differ by scalar factors . With a blockwise Cholesky factorisation, the inversion requires only operations. Since the Cholesky decomposition of the blocks also only differs by a factor, we can store its intermediate result. ∎
. It introduces a representation of an MGIG-distributed random variable as a limit of a random continued fraction of Wishart-distributed random variables. The interested reader should refer to[16, 1, 14] for the details. Drawing samples from the MGIG thus reduces to iterated sampling from the Wishart distribution. In practice, we observe the convergence of the random continued fraction within few iterations. The complexity of sampling from the distribution derived in Theorem 2 does not depend on .
4 Extension to Gaussian Copula
We extend the model for non-Gaussian and mixed continuous/discrete data by embedding it within a copula construction. Copulas describe the dependency in a
-dimensional joint distribution
and represent an invariance class with respect to the marginal cumulative distribution functions (cdf). In our model, . For continuous cdfs, Sklar’s theorem  guarantees the existence and uniqueness of a copula , such that . For discrete cdfs, this leads to an identifiability problem , such that established methods on empirical marginals  cannot be used anymore, but a valid copula can still be constructed . For our purpose, we follow the semi-parametric approach by  and restrict our model to the parametric Gaussian copula, but we do not restrict the data to be Gaussian and treat them in a non-parametric fashion. The Gaussian copula inherently implies latent variables . Our model under consideration is
where denotes the th generalised inverse of continuous or discrete cdfs, are the latent variables, and are the observations.
Following , inference in the latent variables uses the non-decreasing property of discrete cdfs for transforming the observed variables to the latent space. This guarantees that for observations we also have , and more generally, must lie in the set
The data likelihood can then be written as
and estimation of is performed on maximising the sufficient statistics only, thus treating the marginals
as nuisance parameters. Bayesian inference foris achieved by a Markov chain having stationary distribution at the posterior , where a inverse-Wishart prior is used. Posterior inference can be achieved with a Gibbs sampler, which draws alternately between and . This sampler extends Alg. 1 with an additional outer loop for inferring the latent variables. The Markov blanket is then iteratively estimated on these variables. The sampling scheme easily accommodates for missing values, when omitting conditioning on the set .
The presented framework is very useful in practice, since the invariance class of copulas extend the model to non-Gaussian data. With the additional stochastic transformation to the latent space, we can use discrete variables and allow missing values. In real world applications, it becomes apparent that this is a very valuable extension.
5.1 Artificial Data
As a first experiment, we attempt to highlight the differences in inference between the Bayesian Markov blanket (BMB) and Bayesian Graphical Lasso (BGL) procedures. We construct an artificial network with variables, where our interest is confined to only the Markov blanket between query variables and the remaining variables. In order to create networks with a “small-world” flavour containing hubs, i.e. nodes with very high degree, the connectivity structure of the inverse covariance matrix is generated by a beta-binomial model. Edge weights are sampled uniformly from the interval
, and edge signs are randomly flipped. Finally, positive definiteness is guaranteed by adding a suitable constant (related to the smallest eigenvalue) to the diagonal. This process produces a sparse network structure where the majority of edges are connected to only a few single nodes. Note that many real-world networks exhibit such small-world properties.
Next, we draw independent samples from a zero-mean normal distribution with covariance matrix and compute the sample covariance . Fig. 2 depicts a true Markov blanket and its reconstruction by BGL and BMB using the same sparsity parameter . Both methods were run side-by-side for MCMC samples after an initial burn-in phase of samples. From the sampled networks, a representative network structure is constructed by thresholding based on a credibility interval. We repeat the above procedure to obtain a total of datasets. The quality of reconstructed networks is measured in terms of
-score (harmonic mean of precision and recall) between the true and inferred Markov blanket. When computing precision and recall, inferred edges with edge weights having the wrong sign are counted as missing. Both models share the same sparsity parameter, which in this experiment was selected such that for BMB recall and precision have roughly the same value. The results are depicted as box plots in Fig. 3, from which we conclude that there are indeed substantial differences in both models. In particular, BGL has the tendency to introduce many unnecessary edges in comparison to BMB. As a result, BGL achieves high recall and low -score. Since both methods are based on the same likelihood model and (almost) the same prior, the observed differences can only be attributed to differences in the inference procedure: BGL infers a network by iterating over all variables and their neighbourhood systems, whereas BMB only estimates the elements in and .
To further study the influence of the different Gibbs sampling strategies, we examine tracer plots and auto-correlations of individual variables in Fig. 4
. In almost all cases, BGL shows significantly higher auto-correlation and poor convergence. In contrast, Markov chains in the BMB sampler seems to mix much better, typically leading to posteriors with smaller bias and variance. While only one example is shown in the figure, similar results can be seen for basically all variables in the network. Further, we experience a substantial decrease in run-time, even for these relatively small networks: computingMCMC samples for BMB finished on average after seconds, while BGL typically consumed around seconds. Since BGL requires an additional sampling loop over all variables, datasets with large quickly become problematic for BGL. We further explore these differences in the next section for a large real-world application.
5.2 Real Data
To demonstrate the practical significance of Markov blanket estimation, we turn to the analysis of colorectal cancer, which in 2012 ranked among the three most common types of cancer globally . The data set introduced in  is publicly available and contains gene expression measurements from biopsies of cancer patients. A separate table captures discrete/categorical clinical traits such as sex, age or pathological staging/grading. In this context, one particularly interesting research question is to identify connections between the (macroscopic) clinical descriptors and the (molecular) gene expression measurements.
Among the genes contained in the dataset, we focus on a specific subset, the so-called “Pathways in cancer” as defined in the KEGG database222Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/pathway.html. This particular subset comprises a general class of genes which are known to be involved in various biological processes linked to cancer. For this experiment, we have candidate genes and query variables. These are the age and sex of the patient as well as the TNM classification, cancer group stage (GS) and mutation of the tumor suppressor protein p53. Since the observations have mixed continuous/discrete data types with missing values, the Markov blanket estimation is extended by a semi-parametric Gaussian copula framework . Based on this, we calculate MCMC samples, which finally leads to the Markov blanket in Fig. 5.
The resulting network structure confirms some well-known properties like the confounding effect of the age and sex variables, both of which (correctly) link to a large number of genes. For example, FGF21 exhibits significant differences in male and female subjects , and CTNNA1 shares connections to survival time in men . Similarly, mTOR, the mechanistic target of rapamycin, not only represents a key element for cell signaling that triggers a cascade of immune-related pathways, but its function also depends heavily on a subject’s age . Further, we are able to identify a very interesting network structure around the variable tumor size T: almost all direct neighbors control either cell growth (EGLN1 , RELA , HGF [19, 5] and others) or cell death (BCL2, FADD). Cancer typically affects the balance between these two fundamental processes and their deregulation eventually leads to tumor development. A second subgraph concerns variable N, the degree of spread to regional lymph nodes, which is expressed in levels to . Here, all genes in the neighbourhood correspond to the lymphatic system and its direct responses to malignant cell growth, which was confirmed for FGF9 , MDM2 [15, 8] and TRAF4  among others.
Despite the study’s focus on colorectal cancer and specifics of the intestinal system, the inferred Markov blanket is able to explain rather general properties in accordance with findings in the medical literature. Altogether, this nicely illustrates how the Gaussian copula framework complements the Bayesian Markov blanket estimation – especially pertaining to the clinical domain with mixed observations and missing values.
In contrast to our approach, the high dimensionality of this dataset imposes severe problems for BGL. For BMB, 5000 Gibbs sweeps could be computed in 2 hours, and MCMC diagnosis did not show any severe convergence problems. For BGL, however, the same number of iterations already took hours ( days), and we observed similar (and sometimes severe) mixing problems as described in the previous section.
We have presented a Bayesian perspective for estimating the Markov blanket of a set of query nodes in an undirected network. In our experience, it is often the case that we estimate a full network but interpret only part of it. This is especially true in a context where portions of our data are qualitatively different. Here, we would be more interested in establishing the links between these portions, rather than examining the links within the portions themselves. Markov blanket estimation is hence an interesting and relevant sub-problem of network estimation, particularly in high dimensional settings. Existing methods such as the Bayesian graphical lasso iterate through the individual variables to estimate an entire network. While there are several situations in which inference of the entire network is required, there are also cases in which we are only interested in the neighbourhood of a small subset of query variables; for these instances, iterating through all the variables is unnecessary.
In this paper, we explored the blockwise factorisation of the Wishart likelihood in combination with a suitable choice of prior. Our primary contribution in Theorem 1 shows that the resulting posterior distribution of the Markov blanket of a set of query nodes has an analytic form, and is independent of a large portion of the network. The analytic form allows us to explore potentially large neighbourhoods where the Bayesian graphical lasso reaches its limits. We also demonstrated that sampling from the posterior of the Markov blanket is more efficient than the Bayesian graphical lasso. Moreover, we observed fast convergence and superior mixing properties of the Markov chain. We attribute this to the improved flexibility of our sampling strategy.
Including a copula construct in the model further enhances its real world applicability, where mixed data and missing values are prevalent. A particular application in a medical setting is the colorectal example we considered in Section 5.2. Using this approach allowed us to make interesting observations about the interactions between various clinical and genetic factors. Such insights could ultimately contribute to a better understanding of the disease.
Appendix A Model
Let be Wishart distributed with degrees of freedom and inverse covariance , and define the following partitioning:
Here, denotes the th row of , , and .
Appendix B Wishart Distribution
If , , then
We will prove Lemma 1 from the paper.
If , then
The proof is similar to . Factorising the Wishart density according to
the independence follows from
Appendix C Matrix Normal () Distribution
follows a Matrix Normal distribution, if
where mean , column covariance , and row covariance .
If is Wishart distributed, then
is matrix normal distributed.
The proof is similar to . Factorising the Wishart density according to
where we changed variables with Jacobian and integrated over . Factorising the determinants according to
gives the desired result.
Appendix D Sampling from
We will prove Theorem 1, part 1 from the paper.
Theorem 1, Part 1
Vectorised rows of follow a joint Normal distribution
where the matrix is , and is a diagonal matrix containing .
which is equivalent to
where are the vectorised rows of matrix and denotes the Kronecker product. For inclusion of the double exponential prior, it has to be rewritten as
where denotes the th row of , , and . The result follows from multiplying the prior by the expanded density in Eq. (31).
Appendix E Matrix Generalised Inverse Gaussian () Distribution
follows a Matrix Generalised Inverse Gaussian (MGIG) distribution, if and
Let be MGIG distributed, then is also MGIG distributed:
Theorem 1, Part 2
If is Wishart distributed, then
is MGIG distributed.
The proof is similar to . Factorising the Wishart density according to