Graphical models of complex, multivariate datasets, manifest intuitive illustrations of the correlation structures of the data, and are of interest in different disciplines (Whittaker, 2008; Benner et al., 2014; Airoldi, 2007; Carvalho and West, 2007; Bandyopadhyay and Canale, 2016)
. Much work has been undertaken to study the correlation structure of a multivariate dataset comprising multiple measured values of a vector-valued observable, by modelling the joint probability distribution of a set of such observable values, as matrix-normal(Wang and West, 2009; Ni et al., 2017; Gruber and West, 2016)
. In this paper, we simultaneously learn the partial correlation structure and graphical model of a multivariate dataset, while making inference on uncertainties of each, and acknowledge measurement errors in our learning–with the ultimate aim of computing the distance between (posterior probability distributions of) the learnt pair of graphical models of respective datasets. Such distance informs us about the independence of the datasets, generated under different environmental conditions. To this effect, we undertake inference with Metropolis-within-Gibbs-based Bayesian inference(Robert and Casella, 2004), on the correlation matrix given the data, and on the graph given the updated correlation.
Objective and comprehensive uncertainties on the Bayesianly learnt graphical model of given multivariate data, are sparsely available in the literature. Such uncertainties can potentially be very useful in informing us about the range of models that describe the partial correlation structure of the data at hand. Madigan and Raftery (1994) discuss a method for computing model uncertainties by averaging over a set of identified models, and they advance ways for the computation of the posterior model probabilities, by taking advantage of the graphical structure, for two classes of considered models, namely, the recursive causal models (Kiiveri et al., 1984) and the decomposable loglinear models (Goodman, 1970). This method allows them to select the “best models”, while accounting for model uncertainty. Our method on the other hand, provides a direct and well-defined way of learning uncertainties of the graphical model of a given multivariate data. At every update of our learning of the graphical structure of the data, the graph is updated; graphs thus learnt, if identified to lie within an identified range of values of the posterior probability of the graph, comprise the uncertainty-included graphical model of the data. In addition, our method permits incorporation of measurement errors into the learning of the graphical model, and permits fast learning of large networks.
However, we wish to explore extending such learning to higher-dimensional data, for example, to a dataset that is cuboidally-shaped, given that it comprises multiple measurements of a matrix-valued observable. Hoff (2011); Xu et al. (2012); Wang Chakrabarty (under preparation
), advance methods to learn the correlation in high-dimensional data in general. For a rectangularly-shaped multivariate dataset, the pioneering work byWang and West (2009) allows for the learning of both the between-rows and between-columns covariance matrices, and therefore, of two graphical models. Ni et al. (2017) extend this approach to high-dimensional data. However, a high-dimensional graph showing the correlation structure amongst the multiple components of a general hypercuboidally-shaped dataset, is not easy to visualise or interpret. Instead, in this paper, we treat the high-dimensional data as built of correlated rectangularly-shaped slices, given each of which, the between-columns (partial) correlation structure and graphical model are Bayesianly learnt, along with uncertainties, subsequent to our closed-form marginalisation over all between-rows correlation matrices (unlike in the work of Wang and West (2009)). We then compute the Hellinger distance (Matusita, 1953; Banerjee et al., 2015) between the posterior probability densities of the pair of graphical models that are learnt given the respective pair of such rectangularly-shaped data slices, and use the Hellinger affinity measure to infer on the correlation between the datasets. For example, by computing the pairwise Hellinger distance between posterior probability densities of each learnt pair of graphs, we can avoid the inadequacy of trying to capture spatial correlations amongst sets of multivariate observations, by “computing partial correlation coefficients and by specifying and fitting more complex graphical models”, as was noted by Guinness et al. (2014). In fact, our method offers the inter-graph distance for two differently sized datasets. Importantly, we will demonstrate below that our learning of uncertainties in graphical models, allows for the pursuit of the inter-graph distance, thus integrating the dual strands of the work.
Our learnt graphical model of the given data, comprises a set of random inhomogeneous graphs (Frieze and Karonski, 2016) that lie within the credible regions that we define, where each such graph is a generalisation of a Binomial graph (Frieze and Karonski, 2016). We do not make inference on the graph (writing its posterior) clique-by-clique, and neither are we reliant on the closed-form nature of the posteriors to sample from. In other words, we do not need to invoke conjugacy to affect our learning–either of the partial correlation structure of the data or of the graphical model. Often, in Bayesian learning of Gaussian undirected graphs, a Hyper-Inverse-Wishart prior is typically imposed on the covariance matrix of the data, as this then allows for a Hyper-Inverse-Wishart posterior of the covariance, which in turn implies that the marginal posterior of of any clique is Inverse-Wishart–a known, closed-form density (Dawid and Lauritzen, 1993; Lauritzen, 1996). Inference is then rendered easier, than when posterior sampling from a non-closed form posterior needs to be undertaken, using numerical techniques such as MCMC. Now, if the graph is not decomposable, and a Hyper-Inverse-Wishart prior is placed on the covariance matrix, the resulting Hyper-Inverse-Wishart joint posterior density that can be factorised into a set of Inverse-Wishart densities, cannot be identified as the clique marginals. Expressed differently, the clique marginals are not closed-form when the graph is not decomposable. However, this is not a worry in our learning, i.e. we can undertake our learning irrespective of the validity of decomposability.
2 Learning correlation matrix and graphical model given data, using Metropolis-within-Gibbs
Let be a -dimensional observed vector, with . Let there be measurements of , , so that the -dimensional matrix is the data that comprises measurements of the -dimensional observable . Let the -th realisation of be , . We standardise the variable (
) by its empirical mean and standard deviation, into, s.t. the standardised version of data comprises measurements of the -dimensional vector . Thus, , where and . The -dimensional matrix . Then we model the joint probability of a set of measurements of , (such as the set of that comprises the standardised data ), to be matrix-normal with zero-mean, i.e.
i.e. the likelihood of the covariance matrices and , given data , is matrix-normal:
Here generates the covariance between the standardised variables and , , (while generates the covariance between and ). In other words, generates the correlation between rows of the standardised data set . Similarly, generates the correlation between columns of .
The joint posterior probability density of the correlation matrices , given the standardised data is
where is the likelihood of given data . This can be marginalised over the -dimensional between-rows’ correlation , to yield
where the prior on is uniform; prior on is the non-informative , , and is assumed invertible. Here, is a function of that normalises the likelihood.
The joint posterior probability density of , given data :
using the likelihood from Equation 2.1; using prior on to be where ; using prior on to be uniform.
Marginalising out from the joint posterior
, we get:
Here . Now,
let . Then (Mathai and G.Pederzoli, 1997),
let , (using commutativeness of trace),
so that in Equation 2.3, we get
The integral in the RHS of Equation LABEL:eqn:3 represents the unnormalised Wishart
, over all values of the random matrix
, where the scale matrix and degrees of freedom of thisare and respectively, i.e. .
Thus, integral in the RHS of Equation LABEL:eqn:3 is the integral of the unnormalised of , over the full support of ,
i.e. the integral in the RHS of Equation LABEL:eqn:3 is the normalisation of this :
i.e. integral on RHS of Equation LABEL:eqn:3 is proportional to
Now, if is invertible, .
It is given that is invertible, i.e. exists.
The original dataset is examined to discard rows that are linear transformations of each other, leading to data matrix, no two rows of which are linear transformations of each other
positive definite, i.e. is
Using this in Equation 2.5:
This posterior of the between-columns correlation matrix given data , is normalised over all possible datasets, where the possible datasets abide by a column-correlation matrix of , as:
where is a dataset with rows and columns, comprising values of random standardised variables , simulated to bear between-column correlation matrix of , s.t. is positive definite . Choosing the same number of rows for all choices of the random data matrix , i.e. for a constant , . Then is a positive definite function of .
The posterior as given by Theorem 2.1, suggests that we have to compute the normalisation , in every iteration, i.e. for every updated value of . We recall that this normalisation is given by Equation 2.7, where the standardised , are simulated s.t. the column correlation between and is , with . However, it is hard to compute , defined in Equation 2.7
, as a closed-form integral. Instead, we define an estimatorof the integral representing the all-data averaged normalisation, in the -th iteration of the -iteration long MCMC chain that we run, ().
So, we estimate the integral in the RHS of Equation 2.7
using its unbiased estimator. We rephrase the integrand as, i.e.
Then the estimator of the normalisation in Equation 2.7 is
However, it is difficult to implement this estimator by sequentially computing expectations w.r.t. distribution of each of the elements of . Instead, we compute the expectation w.r.t. the block of these elements, where abides by a column-correlation of , i.e., we compute
During the -th iteration of the MCMC chain, let the column correlation matrix be and the normalisation be . Then is defined using and the sample of -dimensional data sets , s.t. is positive definite , at each . Then:
2.1 Learning the graphical model
We perform Bayesian learning of the inhomogeneous, Generalised Binomial random graph, given the learnt -dimensional, between-columns correlation matrix , of the standardised data set . Thus, the graph
is a random variable that gets updated in every iteration of our Bayesian inference scheme (Metropolis-within-Gibbs), at the updated (partial) correlation matrix given the data. Here, the graph, has the vertex set and the between-columns partial correlation matrix of data is , s.t. takes the value , , and . The vertex set is s.t. vertices , are joined by the edge
that is a random binary variable taking values of, where is either 1 or 0, and is the -th element of the edge matrix .
Given a learnt value of the between-columns correlation matrix , to compute the value of the partial correlation variable , we first invert to yield: , s.t.
and for .
The posterior probability density of the graph defined for the edge matrix , is given as
is the prior probability density on the edge parameters, and is the likelihood of the edge parameters, given the partial correlation matrix (that is itself computed using the between-columns correlation matrix , learnt given , (see Equation 2.8). We choose a prior on that is , i.e. ; thus, the prior is independent of the edge parameters. We choose to define the likelihood of any edge parameter given the corresponding partial correlation, to be a Folded Normal density (Leone et al., 1961). Thus, likelihood of the edge parameters given is
where the variance parameters
are indeed hyperparameters that are also learnt from the data; these variance parameters have uniform prior probabilities imposed on them.
In light of the fact that parameters that are learnt, are the edge parameters as well as the variance parameters, we rephrase the likelihood as:
2.2 Inference using Metropolis-within-Gibbs
Equation 2.8 gives the posterior probability density of correlation matrix , given data . In our Metropolis-within-Gibbs based inference, we update –at which the partial correlation matrix is computed. Given this updated , we then update the graph . The graphical model comprising the credible-region defining set of random Binomial graphs is thus learnt, where the vertex set of each graph in this set is fixed as ; the “credible region” in question is defined below in Section 2.3.
To be precise, by updating of the symmetric correlation matrix , we imply updating its non-diagonal elements of the upper (or lower) triangle, i.e. the parameters . Given the nature of the likelihood of the parameters, , (see Equation 2.8), this updating involves inversion of , and computing of the determinants of and , in every iteration. Such is possible with the computation of the square root of the column correlation matrix , as well as the factorisation of into two triangular matrices. Thus, in any iteration, we can compute the -dimensional, (lower by choice) triangular matrix, that is the square root of , i.e. . Here is computed via Cholesky decomposition. Cholesky decomposition of into the (lower) triangular matrix and is also undertaken, following the inversion of into . Since the factors of are already computed (Cholesky decomposed) as and , the inversion of is undertaken using a forward substitution algorithm, i.e. we set , where is computed using in a forward substitution scheme. Then is Cholesky decomposed into and . The underlying schemes for forward substitution and Cholesky Decomposition are discussed in Section 7 of the Supplementary Material.
As mentioned above, we assume that at every iteration, the proposed and are positive definite–to implement which, we define the data to be composed of those rows of the original data set that are linearly dependent. It is still possible that during an iteration, a proposed is s.t. its square root is not positive definite, within some pre-set numerical threshold. One possible solution to this problem is numerical, for example, implementing ridge adjustment (Wothke, 1993), i.e. adding a “small” number to the diagonal elements of , where “small” is typically times a diagonal element of the matrix in our implementation.
As we saw in Section 2, the joint posterior probability density of the elements of the upper/lower triangle of the correlation matrix, given data , is normalised, with the normalisation factor that needs updating at each update of ; . This normalisation factor is defined in Equation 2.7. An estimator of this normalisation in the -th iteration is given in Equation 2.9–as the arithmetic mean of number of random realisations of the integrand in the integral representing the normalisation . Each such realisation results from a sampled data set , that bears the column correlation matrix that is the column correlation updated in this -th iteration of the MCMC chain; , . Here, each of the sampled data sets is generated with rows (and the columns, as is -dimensional). To generate as a randomly sampled -sized data set with column correlation as in the -th iteration, we use established sampling techniques for sampling data given a correlation structure.
The algorithm followed for our Metropolis-within-Gibbs-based inference is the following.
At the 0-th iteration, we choose a seed value of correlation matrix , i.e. choose seed value of parameter , . At this iteration–as at every iteration–Cholesky decomposition of the column correlation matrix is undertaken, i.e. in this iteration, . Then is computed as square of the product of the diagonal elements of . Using , its inverse is computed (using forward substitution). Then . Next, is Cholesky decomposed, to allow for its determinant to be computed. Using , partial correlation matrix in this iteration is computed, i.e. its -th element is computed . For this seed correlation , the estimator of the normalisation factor is computed using Equation 2.9, following the random selection of number of -dimensional data sets with column correlation , . We use . We compute the joint posterior of the non-diagonal, upper triangle elements of , given data , using Equation 2.8.
In the 0-th iteration, seed values () of the edge parameter , are assigned. In addition, seed values are assigned to the parameter; . The joint posterior probability density of and , is given as the product in the likelihood in Equation LABEL:eqn:folded, the uniform prior on each variance parameter and the Bernoulli prior with rate parameter 0.5 on each edge parameter.
As the -th iteration begins, (), we propose a value for , from a Truncated Normal density that is left truncated at -1 and right truncated at 1, i.e.
. where is the experimentally chosen variance, and the proposal mean is the current value of at the end of the -th iteration. Let the proposed value of the correlation matrix in the -th iteration be called , and the current value . We refer to the estimator of the normalisation at the proposed as , and compute it using the generated sample in Equation 2.9. The current normalisation at the current correlation matrix is computed using the generated sample in Equation 2.9. Then we accept the proposed , at a probability of
where is given in Equation 2.6, as . If , where , we update the correlation matrix to ; else . We invert the current correlation to , to compute the current value of partial correlation matrix (see Equation 2.10).
At the 2nd block of the -th iteration, the graph variable is updated, given the current partial correlation matrix . To be precise, we propose , . Also, for each , we propose , where are the experimentally chosen variance and the mean is the current value of . Then we accept the proposed , , at the probability of
where If the acceptance probability , where , we accept the proposed values of the edge and variance parameters, i.e. set and , . Otherwise, the proposed values of the parameters are rejected and we set the current values in iteration to be those of the previous iteration. The graph at the end of the -th iteration is , where is the updated partial correlation matrix in the -iteration.
Stop if ; else repeat Steps 2.
2.3 Definition of 95 HPD credible regions on the random graph variable and the learnt graphical model
We perform Bayesian inference on the random graph variable , leading to one sampled graph at the end of each of the iterations of our inference scheme (Metropolis-within-Gibbs). In order to acknowledge uncertainties in the Bayesian learning of a single representation of this graphical model, we need to include in its definition, only those graphs–sampled post-burnin–that lie within an identified 95 HPD credible region. We define the fraction of the post-burnin number of iterations (where ), in which the -th edge exists, i.e. takes the value 1, . Thus, we define the variable that takes the value
Then is the fractional number of sampled graphs, in which an edge exists between vertices and . This leads us to interpret as carrying information about the uncertainty in the graph learnt given data ; in particular, approximates the probability of existence of the edge between the -th and -th nodes in the graphical model of the data at hand. Indeed the parameters are functions of the partial correlation matrix that is learnt given this data, but for the sake of notational brevity, we do not include this explicit dependence in our notation to denote the edge probability parameters.
So we view the set