1 Introduction
Graphical models of complex, multivariate datasets, manifest intuitive illustrations of the correlation structures of the data, and are of interest in different disciplines (Whittaker, 2008; Benner et al., 2014; Airoldi, 2007; Carvalho and West, 2007; Bandyopadhyay and Canale, 2016)
. Much work has been undertaken to study the correlation structure of a multivariate dataset comprising multiple measured values of a vectorvalued observable, by modelling the joint probability distribution of a set of such observable values, as matrixnormal
(Wang and West, 2009; Ni et al., 2017; Gruber and West, 2016). In this paper, we simultaneously learn the partial correlation structure and graphical model of a multivariate dataset, while making inference on uncertainties of each, and acknowledge measurement errors in our learning–with the ultimate aim of computing the distance between (posterior probability distributions of) the learnt pair of graphical models of respective datasets. Such distance informs us about the independence of the datasets, generated under different environmental conditions. To this effect, we undertake inference with MetropoliswithinGibbsbased Bayesian inference
(Robert and Casella, 2004), on the correlation matrix given the data, and on the graph given the updated correlation.Objective and comprehensive uncertainties on the Bayesianly learnt graphical model of given multivariate data, are sparsely available in the literature. Such uncertainties can potentially be very useful in informing us about the range of models that describe the partial correlation structure of the data at hand. Madigan and Raftery (1994) discuss a method for computing model uncertainties by averaging over a set of identified models, and they advance ways for the computation of the posterior model probabilities, by taking advantage of the graphical structure, for two classes of considered models, namely, the recursive causal models (Kiiveri et al., 1984) and the decomposable loglinear models (Goodman, 1970). This method allows them to select the “best models”, while accounting for model uncertainty. Our method on the other hand, provides a direct and welldefined way of learning uncertainties of the graphical model of a given multivariate data. At every update of our learning of the graphical structure of the data, the graph is updated; graphs thus learnt, if identified to lie within an identified range of values of the posterior probability of the graph, comprise the uncertaintyincluded graphical model of the data. In addition, our method permits incorporation of measurement errors into the learning of the graphical model, and permits fast learning of large networks.
However, we wish to explore extending such learning to higherdimensional data, for example, to a dataset that is cuboidallyshaped, given that it comprises multiple measurements of a matrixvalued observable. Hoff (2011); Xu et al. (2012); Wang Chakrabarty (under preparation
), advance methods to learn the correlation in highdimensional data in general. For a rectangularlyshaped multivariate dataset, the pioneering work by
Wang and West (2009) allows for the learning of both the betweenrows and betweencolumns covariance matrices, and therefore, of two graphical models. Ni et al. (2017) extend this approach to highdimensional data. However, a highdimensional graph showing the correlation structure amongst the multiple components of a general hypercuboidallyshaped dataset, is not easy to visualise or interpret. Instead, in this paper, we treat the highdimensional data as built of correlated rectangularlyshaped slices, given each of which, the betweencolumns (partial) correlation structure and graphical model are Bayesianly learnt, along with uncertainties, subsequent to our closedform marginalisation over all betweenrows correlation matrices (unlike in the work of Wang and West (2009)). We then compute the Hellinger distance (Matusita, 1953; Banerjee et al., 2015) between the posterior probability densities of the pair of graphical models that are learnt given the respective pair of such rectangularlyshaped data slices, and use the Hellinger affinity measure to infer on the correlation between the datasets. For example, by computing the pairwise Hellinger distance between posterior probability densities of each learnt pair of graphs, we can avoid the inadequacy of trying to capture spatial correlations amongst sets of multivariate observations, by “computing partial correlation coefficients and by specifying and fitting more complex graphical models”, as was noted by Guinness et al. (2014). In fact, our method offers the intergraph distance for two differently sized datasets. Importantly, we will demonstrate below that our learning of uncertainties in graphical models, allows for the pursuit of the intergraph distance, thus integrating the dual strands of the work.Our learnt graphical model of the given data, comprises a set of random inhomogeneous graphs (Frieze and Karonski, 2016) that lie within the credible regions that we define, where each such graph is a generalisation of a Binomial graph (Frieze and Karonski, 2016). We do not make inference on the graph (writing its posterior) cliquebyclique, and neither are we reliant on the closedform nature of the posteriors to sample from. In other words, we do not need to invoke conjugacy to affect our learning–either of the partial correlation structure of the data or of the graphical model. Often, in Bayesian learning of Gaussian undirected graphs, a HyperInverseWishart prior is typically imposed on the covariance matrix of the data, as this then allows for a HyperInverseWishart posterior of the covariance, which in turn implies that the marginal posterior of of any clique is InverseWishart–a known, closedform density (Dawid and Lauritzen, 1993; Lauritzen, 1996). Inference is then rendered easier, than when posterior sampling from a nonclosed form posterior needs to be undertaken, using numerical techniques such as MCMC. Now, if the graph is not decomposable, and a HyperInverseWishart prior is placed on the covariance matrix, the resulting HyperInverseWishart joint posterior density that can be factorised into a set of InverseWishart densities, cannot be identified as the clique marginals. Expressed differently, the clique marginals are not closedform when the graph is not decomposable. However, this is not a worry in our learning, i.e. we can undertake our learning irrespective of the validity of decomposability.
2 Learning correlation matrix and graphical model given data, using MetropoliswithinGibbs
Let be a dimensional observed vector, with . Let there be measurements of , , so that the dimensional matrix is the data that comprises measurements of the dimensional observable . Let the th realisation of be , . We standardise the variable (
) by its empirical mean and standard deviation, into
, s.t. the standardised version of data comprises measurements of the dimensional vector . Thus, , where and . The dimensional matrix . Then we model the joint probability of a set of measurements of , (such as the set of that comprises the standardised data ), to be matrixnormal with zeromean, i.e.i.e. the likelihood of the covariance matrices and , given data , is matrixnormal:
(2.1) 
Here generates the covariance between the standardised variables and , , (while generates the covariance between and ). In other words, generates the correlation between rows of the standardised data set . Similarly, generates the correlation between columns of .
Theorem 2.1.
The joint posterior probability density of the correlation matrices , given the standardised data is
where is the likelihood of given data . This can be marginalised over the dimensional betweenrows’ correlation , to yield
where the prior on is uniform; prior on is the noninformative , , and is assumed invertible. Here, is a function of that normalises the likelihood.
Proof.
The joint posterior probability density of , given data :
using the likelihood from Equation 2.1; using prior on to be where ; using prior on to be uniform.
Marginalising out from the joint posterior
, we get:
(2.3) 
Here . Now,

let . Then (Mathai and G.Pederzoli, 1997),

let , (using commutativeness of trace),
so that in Equation 2.3, we get
The integral in the RHS of Equation LABEL:eqn:3 represents the unnormalised Wishart
, over all values of the random matrix
, where the scale matrix and degrees of freedom of this
are and respectively, i.e. .Thus, integral in the RHS of Equation LABEL:eqn:3 is the integral of the unnormalised of , over the full support of ,
i.e. the integral in the RHS of Equation LABEL:eqn:3 is the normalisation of this :
i.e. integral on RHS of Equation LABEL:eqn:3 is proportional to
(2.5) 
Now, if is invertible, .

It is given that is invertible, i.e. exists.

The original dataset is examined to discard rows that are linear transformations of each other, leading to data matrix
, no two rows of which are linear transformations of each other
is
positive definite, i.e. is
invertible,
.
Using this in Equation 2.5:
(2.6) 
This posterior of the betweencolumns correlation matrix given data , is normalised over all possible datasets, where the possible datasets abide by a columncorrelation matrix of , as:
(2.7) 
where is a dataset with rows and columns, comprising values of random standardised variables , simulated to bear betweencolumn correlation matrix of , s.t. is positive definite . Choosing the same number of rows for all choices of the random data matrix , i.e. for a constant , . Then is a positive definite function of .
The posterior as given by Theorem 2.1, suggests that we have to compute the normalisation , in every iteration, i.e. for every updated value of . We recall that this normalisation is given by Equation 2.7, where the standardised , are simulated s.t. the column correlation between and is , with . However, it is hard to compute , defined in Equation 2.7
, as a closedform integral. Instead, we define an estimator
of the integral representing the alldata averaged normalisation, in the th iteration of the iteration long MCMC chain that we run, ().So, we estimate the integral in the RHS of Equation 2.7
using its unbiased estimator. We rephrase the integrand as
, i.e.Then the estimator of the normalisation in Equation 2.7 is
However, it is difficult to implement this estimator by sequentially computing expectations w.r.t. distribution of each of the elements of . Instead, we compute the expectation w.r.t. the block of these elements, where abides by a columncorrelation of , i.e., we compute
During the th iteration of the MCMC chain, let the column correlation matrix be and the normalisation be . Then is defined using and the sample of dimensional data sets , s.t. is positive definite , at each . Then:
(2.9) 
2.1 Learning the graphical model
We perform Bayesian learning of the inhomogeneous, Generalised Binomial random graph, given the learnt dimensional, betweencolumns correlation matrix , of the standardised data set . Thus, the graph
is a random variable that gets updated in every iteration of our Bayesian inference scheme (MetropoliswithinGibbs), at the updated (partial) correlation matrix given the data. Here, the graph
, has the vertex set and the betweencolumns partial correlation matrix of data is , s.t. takes the value , , and . The vertex set is s.t. vertices , are joined by the edgethat is a random binary variable taking values of
, where is either 1 or 0, and is the th element of the edge matrix .Given a learnt value of the betweencolumns correlation matrix , to compute the value of the partial correlation variable , we first invert to yield: , s.t.
(2.10) 
and for .
The posterior probability density of the graph defined for the edge matrix , is given as
where
is the prior probability density on the edge parameters
, and is the likelihood of the edge parameters, given the partial correlation matrix (that is itself computed using the betweencolumns correlation matrix , learnt given , (see Equation 2.8). We choose a prior on that is , i.e. ; thus, the prior is independent of the edge parameters. We choose to define the likelihood of any edge parameter given the corresponding partial correlation, to be a Folded Normal density (Leone et al., 1961). Thus, likelihood of the edge parameters given iswhere the variance parameters
are indeed hyperparameters that are also learnt from the data; these variance parameters have uniform prior probabilities imposed on them.
In light of the fact that parameters that are learnt, are the edge parameters as well as the variance parameters, we rephrase the likelihood as:
2.2 Inference using MetropoliswithinGibbs
Equation 2.8 gives the posterior probability density of correlation matrix , given data . In our MetropoliswithinGibbs based inference, we update –at which the partial correlation matrix is computed. Given this updated , we then update the graph . The graphical model comprising the credibleregion defining set of random Binomial graphs is thus learnt, where the vertex set of each graph in this set is fixed as ; the “credible region” in question is defined below in Section 2.3.
To be precise, by updating of the symmetric correlation matrix , we imply updating its nondiagonal elements of the upper (or lower) triangle, i.e. the parameters . Given the nature of the likelihood of the parameters, , (see Equation 2.8), this updating involves inversion of , and computing of the determinants of and , in every iteration. Such is possible with the computation of the square root of the column correlation matrix , as well as the factorisation of into two triangular matrices. Thus, in any iteration, we can compute the dimensional, (lower by choice) triangular matrix, that is the square root of , i.e. . Here is computed via Cholesky decomposition. Cholesky decomposition of into the (lower) triangular matrix and is also undertaken, following the inversion of into . Since the factors of are already computed (Cholesky decomposed) as and , the inversion of is undertaken using a forward substitution algorithm, i.e. we set , where is computed using in a forward substitution scheme. Then is Cholesky decomposed into and . The underlying schemes for forward substitution and Cholesky Decomposition are discussed in Section 7 of the Supplementary Material.
As mentioned above, we assume that at every iteration, the proposed and are positive definite–to implement which, we define the data to be composed of those rows of the original data set that are linearly dependent. It is still possible that during an iteration, a proposed is s.t. its square root is not positive definite, within some preset numerical threshold. One possible solution to this problem is numerical, for example, implementing ridge adjustment (Wothke, 1993), i.e. adding a “small” number to the diagonal elements of , where “small” is typically times a diagonal element of the matrix in our implementation.
As we saw in Section 2, the joint posterior probability density of the elements of the upper/lower triangle of the correlation matrix, given data , is normalised, with the normalisation factor that needs updating at each update of ; . This normalisation factor is defined in Equation 2.7. An estimator of this normalisation in the th iteration is given in Equation 2.9–as the arithmetic mean of number of random realisations of the integrand in the integral representing the normalisation . Each such realisation results from a sampled data set , that bears the column correlation matrix that is the column correlation updated in this th iteration of the MCMC chain; , . Here, each of the sampled data sets is generated with rows (and the columns, as is dimensional). To generate as a randomly sampled sized data set with column correlation as in the th iteration, we use established sampling techniques for sampling data given a correlation structure.
The algorithm followed for our MetropoliswithinGibbsbased inference is the following.

At the 0th iteration, we choose a seed value of correlation matrix , i.e. choose seed value of parameter , . At this iteration–as at every iteration–Cholesky decomposition of the column correlation matrix is undertaken, i.e. in this iteration, . Then is computed as square of the product of the diagonal elements of . Using , its inverse is computed (using forward substitution). Then . Next, is Cholesky decomposed, to allow for its determinant to be computed. Using , partial correlation matrix in this iteration is computed, i.e. its th element is computed . For this seed correlation , the estimator of the normalisation factor is computed using Equation 2.9, following the random selection of number of dimensional data sets with column correlation , . We use . We compute the joint posterior of the nondiagonal, upper triangle elements of , given data , using Equation 2.8.

In the 0th iteration, seed values () of the edge parameter , are assigned. In addition, seed values are assigned to the parameter; . The joint posterior probability density of and , is given as the product in the likelihood in Equation LABEL:eqn:folded, the uniform prior on each variance parameter and the Bernoulli prior with rate parameter 0.5 on each edge parameter.

As the th iteration begins, (), we propose a value for , from a Truncated Normal density that is left truncated at 1 and right truncated at 1, i.e.
. where is the experimentally chosen variance, and the proposal mean is the current value of at the end of the th iteration. Let the proposed value of the correlation matrix in the th iteration be called , and the current value . We refer to the estimator of the normalisation at the proposed as , and compute it using the generated sample in Equation 2.9. The current normalisation at the current correlation matrix is computed using the generated sample in Equation 2.9. Then we accept the proposed , at a probability of
where is given in Equation 2.6, as . If , where , we update the correlation matrix to ; else . We invert the current correlation to , to compute the current value of partial correlation matrix (see Equation 2.10).

At the 2nd block of the th iteration, the graph variable is updated, given the current partial correlation matrix . To be precise, we propose , . Also, for each , we propose , where are the experimentally chosen variance and the mean is the current value of . Then we accept the proposed , , at the probability of
where If the acceptance probability , where , we accept the proposed values of the edge and variance parameters, i.e. set and , . Otherwise, the proposed values of the parameters are rejected and we set the current values in iteration to be those of the previous iteration. The graph at the end of the th iteration is , where is the updated partial correlation matrix in the iteration.

Stop if ; else repeat Steps 2.
2.3 Definition of 95 HPD credible regions on the random graph variable and the learnt graphical model
We perform Bayesian inference on the random graph variable , leading to one sampled graph at the end of each of the iterations of our inference scheme (MetropoliswithinGibbs). In order to acknowledge uncertainties in the Bayesian learning of a single representation of this graphical model, we need to include in its definition, only those graphs–sampled postburnin–that lie within an identified 95 HPD credible region. We define the fraction of the postburnin number of iterations (where ), in which the th edge exists, i.e. takes the value 1, . Thus, we define the variable that takes the value
(2.12) 
Then is the fractional number of sampled graphs, in which an edge exists between vertices and . This leads us to interpret as carrying information about the uncertainty in the graph learnt given data ; in particular, approximates the probability of existence of the edge between the th and th nodes in the graphical model of the data at hand. Indeed the parameters are functions of the partial correlation matrix that is learnt given this data, but for the sake of notational brevity, we do not include this explicit dependence in our notation to denote the edge probability parameters.
So we view the set
Comments
There are no comments yet.