Graphical models describe intrinsic relationships among a collection of variables. Each variable in the collection is represented by a node or a vertex. Two nodes in the graph are connected by an edge if and only if the corresponding variables are not conditionally independent given the remaining variables. Conditional independence impacts the precision matrix, that is, the inverse covariance matrix, by setting the
th off-diagonal entry to zero if the random variables associated with theth and th nodes are conditionally independent given others. Conditional independence makes the partial correlation coefficient between the random variables associated with the th and
th entries equal to zero as well. If the random variables in the collection can be assumed to be jointly normally distributed, then the conditional independence between theth and the th variables is exactly equivalent to having the
th entry of the precision matrix equal to zero. Such models are known as Gaussian Graphical Models (GGMs). Learning the conditional dependence structure in a GGM is therefore equivalent to estimating the corresponding precision matrix under the assumed sparsity condition. Modeling intrinsic dependence between random variables through GGMs is commonly used in biology, finance, and the social sciences.
Estimation of a sparse precision matrix needs some form of regularization. In the non-Bayesian literature, the estimation is typically carried out by minimizing the penalized log-likelihood of the data with the -penalty on the elements of the precision matrix. This method is known as the graphical lasso (Friedman et al., 2008). Many algorithms have been proposed to solve this problem (Meinshausen and Buhlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Banerjee et al., 2008; d’Aspremont et al., 2008; Rothman et al., 2008; Lu, 2009; Scheinberg et al., 2010; Witten et al., 2011; Mazumder and Hastie, 2012).
Bayesian methods for GGMs involve using priors on the precision matrix and priors on the graph as well. A popular prior on a precision matrix is given by the family of G-Wishart priors (Giudici, 1999; Letac and Massam, 2007; Wang and Li, 2012). The G-Wishart prior is conjugate to multivariate normal random variables and yields an explicit expression for the posterior mean. If the underlying graph is decomposable, the normalizing constant in a G-Wishart distribution has a simple closed form expression. In the absence of decomposability, the expression is more complex (Uhler et al., 2018), but may be computed by simulations. Simulations from a G-Wishart distribution is possible using the R package BDgraph (Mohammadi and Wit, 2017, 2019), which uses an explicit expression for the normalizing constant for a decomposable graph and uses the birth-death MCMC algorithm (Mohammadi and Wit, 2015)
if the graph is not decomposable. This allows computation of the marginal likelihood, and hence the posterior probability, of any given graph. However, as the number of possible graphs is huge, computing posterior probabilities of all graphs is an impossible task for even a modest number of nodes. The problem is worsened by the fact that a very low fraction of graphs are decomposable. Thus when learning the graphical structure from the data, alternative mechanisms of putting priors on the entries of the precision matrix that allow sparsity are typically employed. A prior that models a sparse precision matrix is ideally a mixture of a point mass at zero and a continuous component(Wong et al., 2003; Carter et al., 2011; Talluri et al., 2014; Banerjee and Ghosal, 2015). However, since the normalizing constants in these mixture priors are intractable due to the positive definiteness constraint on the precision matrix, absolutely continuous priors have been proposed. The Bayesian graphical lasso (Wang, 2012) has been developed as a Bayesian counterpart to the graphical lasso. However, its use of a double exponential prior, which does not have enough mass at zero, does not give a true Bayesian model for sparsity. Continuous shrinkage priors, such as the horseshoe (Carvalho et al., 2010), generalized double Pareto (Armagan et al., 2013), Dirichet-Laplace (Bhattacharya et al., 2015), and others have been proposed as better models of sparsity since these priors have infinite spikes at zero and heavy tails.
Only a few results on the frequentist behavior of Bayesian methods for precision matrix estimation exist in the literature. Banerjee and Ghosal (2014) studied posterior convergence rates for a G-Wishart prior inducing a banding structure, but the true precision matrix need not have a banded structure. Banerjee and Ghosal (2015) provided results on posterior contraction rates for the precision matrix under point mass spike-and-slab priors.
Although GGMs are useful, the distributional assumption may fail to hold on some occasions. A nonparametric extension of the normal distribution is the nonparanormal distribution in which the random variables are replaced by some transformed random variables and it is assumed that has a -variate normal distribution ) (Liu et al., 2009). In some situations, the logarithmic transform may be appropriate, but in general, the transformations are hard to specify. It is, therefore, more sensible to let be unspecified, and use a nonparametric technique for their estimation. Liu et al. (2009) designed the nonparanormal graphical model, a two-step estimation process in which the functions were estimated first using a truncated empirical distribution function, and then the inverse covariance matrix was estimated using the graphical lasso applied to the transformed data. Although the approach in Liu et al. (2009) works well in many settings, their estimator for the transformation functions is based on the empirical distribution function, which leads to an unsmooth estimator. While the focus of this paper is on the nonparanormal graphical model, an alternative to the nonparanormal graphical model is the copula Gaussian graphical model (Pitt et al., 2006; Dobra and Lenkoski, 2011; Liu et al., 2012; Mohammadi and Wit, 2017) which avoids estimation of the transformation functions by using rank-based methods to transform the observed variables.
Bayesian approaches can naturally blend the desired smoothness in the estimate by considering a prior on a function space that consists of smooth functions. Gaussian process priors are the most commonly used priors on functions (Rasmussen and Williams, 2006; Choudhuri et al., 2007; van der Vaart and van Zanten, 2007; Lenk and Choi, 2017). Priors on function spaces have also been developed using a finite random series of certain basis functions like trigonometric polynomials, B-splines, or wavelets (Rivoirard and Rousseau, 2012; de Jonge and van Zanten, 2012; Arbel et al., 2013; Shen and Ghosal, 2015). We consider a Bayesian approach using a finite random series of B-splines prior on the underlying transformations. We choose the B-splines basis over other possible choices because B-splines can easily accommodate restrictions on functions, such as monotonicity and linear constraints, without compromising good approximation properties (Shen and Ghosal, 2015). In our context, as the transformation functions
are increasing, imposing the monotonicity restriction through the prior is essential. This can be easily installed through a finite random series of B-splines by imposing the order restriction on the coefficients. By equipping the vector of the coefficients with a multivariate normal prior truncated to the cone of ordered coordinates, the order restriction can be imposed maintaining the conjugacy inherited from the original multivariate normal distribution. A simple Gibbs sampler is constructed in which first, a truncated normal prior on the transformation functions results in a truncated normal posterior distribution that is sampled using a Hamiltonian Monte Carlo technique(Pakman and Paninski, 2014). Second, a Student t-spike-and-slab prior on the precision matrix of the transformed variables results in sampling the corresponding posterior distribution of the precision matrix and the edge matrix, which determines the absence or presence of an edge in the graphical model. The underlying graphical structure can then be constructed from the obtained edge matrix.
The paper is organized as follows. In the next section, we state model assumptions of the Gaussian graphical model and the nonparanormal graphical model. In addition, we specify the prior distributions for the underlying parameters. In Section 3, we obtain the posterior distributions, describe the Gibbs sampling algorithm and the tuning procedure. In Section 4, we provide a posterior consistency result for the priors under consideration. In Section 5, we present a simulation study. In Section 6, we apply the method to a real data set and finally, we conclude with a discussion section.
2 Model and Priors
Let denote a random vector that is distributed as -variate multivariate normal, . The undirected graph that corresponds to this distribution consists of a vertex set , which has elements for each component of , and an edge set
which consists of ordered pairswhere if there is an edge between and . The edge between is excluded from if and only if is independent of given all other variables. For multivariate normal distributions, the conditional independence holds if and only if ; here for a matrix , denotes its th element.
A random vector has a nonparanormal distribution if there exist smooth monotone functions such that , where . In this case we shall write .
By assuming that the transformed variables are distributed as normal, the conditional independence information in the nonparanormal model is completely contained in the parameter
, as in a parametric normal model. Since the transformation functions are one-to-one, the inherent dependency structure given by the graph for the observed variables is retained by the transformed variables. We note that any continuous random variable can be transformed into a normal variable by a strictly increasing transformation. However, testing for high-dimensional multivariate normality is not feasible, and hence testing for the nonparanormality assumption is not possible in high dimension, but clearly, the condition is a lot more general than multivariate normality. Instead of testing for nonparanormality, one may assess the efficacy of the assumption by looking at the effect of the transformations. If the transformation functions are linear, then assuming multivariate normality should be adequate. If the transformation functions are non-linear, then modeling through the nonparanormal distribution may be useful.
We put prior distributions on the unknown transformation functions through a random series based on B-splines. The coefficients are ordered to induce monotonicity, and the smoothness is controlled by the degree of the B-splines and the number of basis functions used in the expansion. Cubic splines, which are B-splines of degree 4, are used in this paper. The resulting posterior means of the coefficients give rise to a monotone smooth Bayes estimate of the underlying transformations.
Thus the smooth monotone functions that we use to estimate the true transformation functions are assumed to be multivariate normal,
where is a -vector of functions, is an matrix, and is a -vector; here are the B-spline basis functions, are the associated coefficients in the expansion of the function, and is the number of B-spline basis functions used in the expansion. These transformed variables are subsequently used to estimate the sparse precision matrix and hence in structure learning.
In the next part, we discuss the prior on the coefficients in more detail.
Prior on the B-spline coefficients
First, we temporarily disregard the monotonicity issue and put a normal prior on the coefficients of the B-splines, , where is some positive constant, is some vector of constants, and
is the identity matrix. A normal prior is convenient as it leads to conjugacy. However, apart from monotonicity of the transformations, we also need to address identifiability since unknownand allow flexibility in the location and the scale of the transformation so that the distribution of can be multivariate normal for many different choices of . The easiest way to address identifiability is to standardize the transformations by setting and the diagonal entries of to . However, then it will be more difficult to put a prior on sparse complying with the restriction on the diagonal entries of because of the constraint . Hence it is easier to keep and free and impose restrictions on the locations and the scales of the transformation functions , . There are different ways to impose constraints on the locations and scales of . One can impose some location and scale restrictions on the corresponding B-spline coefficients, for instance, by making the mean
and the variance. Then the prior distribution for , , will have to be conditioned on these restrictions. The non-linearity of the variance restriction makes the prior less tractable. In order to obtain a conjugate normal prior, we instead consider the following two linear constraints on the coefficients through function values of the transformations:
It may be noted that, as only a few B-spline functions are non-zero at any given point, the restrictions (2.2) and (2.3) involve only a few s. More specifically, as the degree of B-splines used in this paper is , the first equation involves only coefficients and the second only , no matter how large is.
The linear constraints can be written in matrix form as
Using conditional normal distribution theory, the resulting prior on the coefficients is
where the prior mean and variance are
However, the prior dispersion matrix is singular due to the two linear constraints, resulting in a lack of Lebesgue density for the prior distribution on . Thus, we work with a dimension reduced coefficient vector by removing two coefficients to ensure that we have a Lebesgue density on for the remaining components. Suppose we remove the last two coefficients. Then, the reduced vector of basis coefficients is . Then we can solve for and using to obtain,
where are the corresponding constants. In matrix form, we have
where and .
Then the resulting prior for the coefficients for each predictor is,
where the reduction is denoted with a bar.
Finally, we impose the monotonicity constraint on the coefficients, which is equivalent with the series of inequalities and expressed in matrix/vector form is , where is ,
Due to the two linear constraints, the monotonicity constraint reduces to
where is the matrix,
and is the constant -vector, .
The final prior on the coefficients is given by a truncated normal prior distribution
where , and the -distribution restricted on a set is denoted by . The conjugacy property of the prior distribution is preserved by the truncation. Instead of the simplifying example of solving for the last two coefficients, we use a more general method to reduce the dimension. The Symbolic Math Toolbox in MATLAB is used to solve for any two coefficients in terms of the remaining coefficients. In particular, for the first row of the linear constraints matrix given by (2.5), we find the first column with a nonzero element. Then, for the second row of the linear constraints matrix, we find the first column with a nonzero element that is not the same as the column selected from the first row. We use the indices from those two columns to select the two coefficients that will be removed from the dimension in order to find , and
Although any choice of is admissible, the prior can put a substantial probability of the truncation set only when the original mean vector has increasing components. A simple choice of involving only two hyperparameters is given by
where is a constant, is a positive constant, and
is the inverse of the cumulative distribution function (i.e. the quantile function) of the standard normal distribution. The motivation for the choice comes from imagining that the prior distribution of eachas before the ordering is imposed, and hence the expectations of the order statistics of may be considered as good choices for their means. The expression in (2.15) gives a reasonable approximation to these expectations. Similar expressions appear for the score function of locally most powerful rank tests against normal alternatives (see Hájek et al. (1999)). Royston (1982) described the expression , , as a more accurate approximation for the expected values of standard normal order statistics than the expression used in rank tests.
Prior on the mean
For each predictor, we put an improper uniform prior on .
Prior on the precision matrix
We build on the techniques of Wang (2015), which use a normal spike-and-slab prior to estimate a sparse precision matrix, but replace the normal by a Student t-distribution spike-and-slab prior, following Scheipl et al. (2012). Let be the slab variance and be the spike variance. The spike scale is assumed to be very small and given. Having a continuous spike instead of a point mass at zero is more convenient since it admits density; see Wang (2015). Unlike in Wang (2015), we estimate the sparse precision matrix by allowing the spike-and-slab variances and probability to be random with an inverse-gamma prior to lead to a Student t-distribution for the slabs. The diagonal entries of
are given an exponential distribution with rate parameterfor some
. We introduce a symmetric matrix of latent binary variableswith binary entries to represent the edge matrix. The entries , , are assumed to be independent with denoting the probability of , i.e. the probability of an edge. Let and respectively stand for the densities of the normal and exponential distributions. Let . Let stand for the space of positive definite matrices and . The joint prior for and is then obtained as
The prior for are given by, independently of each other,controls the distribution of the diagonal elements of . We use under similar reasoning to Wang (2015)
, because it assigns a considerable probability to the region of reasonable values of the diagonal elements. We set the shape parameters of the beta distribution to 1 and 10 to set the prior probability of sparsity to about 10%. SeeScheipl et al. (2012) for more details regarding the spike-and-slab prior based on a mixture of inverse gamma distributions.
3 Posterior Computation
The full posterior distribution is
where , the prior on the B-spline coefficients is , the prior on the means is , and the joint prior on the sparse precision matrix and the edge matrix is . Here, the likelihood is constructed from the working assumption that .
The joint posteriors are standard and so they are not derived. They can be evaluated in the following Gibbs sampling algorithm.
3.1 Gibbs Sampling Algorithm
For every , sample the B-spline coefficients as follows.
Since we can reduce the number of coefficients by two, the basis functions for these two coefficients can be represented as
where the is used to denote the two-dimensional vectors and .
Setting , the joint posterior for the B-spline coefficients is a truncated normal, with density
restricted on the region to satisfy the monotonicity constraint.
However, this truncated multivariate normal distribution is
dimensional, so we sample it using the following conditional normals in a Markov chain,
where using the conditional normal theory,
Samples from the truncated conditional normal posterior distributions for the B-spline coefficients are obtained using the exact Hamiltonian Monte Carlo algorithm (exact HMC) (Pakman and Paninski, 2014). Each iteration of the exact HMC results in a transition kernel which leaves the target distribution invariant and the Metropolis acceptance probability equal to 1. The exact HMC within Gibbs is like Metropolis within Gibbs and hence is a valid algorithm to sample from the joint density.
Obtain the centered transformed variables:
The posterior density of given is
For every , sample each column vector of and using the following partitions as described in Wang (2015):
Denote to be the symmetric matrix with zeros in the diagonal and in the upper diagonal entries. Similarly, denote and to be symmetric matrices with zeros in the diagonal and and in the upper diagonal entries, respectively.
Without loss of generality, partition , and by focusing on the last column and row:
To sample a column vector of , use the following change of variables:
Then the full conditionals are given by
where , and Ga stands for the gamma distribution.
To sample the corresponding off-diagonal column vector of the edge-inclusion indexes , , , since the are independent Bernoulli, we sample according to the probability
where stands for the normal density function.
Update , , , based on the off-diagonal column vectors and , using the relations
Update , , , based on the off-diagonal entry ,
These steps are repeated until convergence.
3.2 Choice of Prior Parameters
We use a model selection criterion to determine the optimal number of basis functions pre-MCMC. Sampling methods that involve putting a prior on the number of basis functions, such as reversible jump Monte Carlo, are computationally complicated. We calculate the Akaike Information Criterion (AIC) for different numbers of basis functions and choose the number of basis functions that correspond to the lowest AIC. The AIC is determined as the minimum of two times the negative log-likelihood , plus the number of parameters in the model, with respect to the basis coefficients subject to the linear and monotonicity constraints. The AIC is preferred here as the true transform does not belong to the set of splines and hence the correct model selection is not the goal, but minimizing the estimated estimation error is, which is provided by the model with the lowest AIC. The lowest AIC is found between a grid of four and 100 basis functions by doing a search in which the lowest AIC is chosen when the next ten values are larger than the current value in the search, since the AIC should approximately be a U-shaped curve due to the trade-off between accuracy and complexity. Then for each predictor, , and for the number of basis functions, ,
After plugging in the maximum likelihood estimators (MLEs) of and and making the substitution , minimizing the results in the following problem,
This problem can be equivalently solved using the quadratic programming function in MATLAB Optimization Toolbox:
For numerical stability, the monotonicity constraint was changed to . Finally, after plugging in the solution of the quadratic programming problem , the final number of basis functions is chosen by selecting the number that minimizes the AIC
There is some dependence on the choice of hyperparameters. We use a model selection criterion to determine the hyperparameters, and , for inverse gamma distributions for and to determine the constant value for the spike scale, , after the MCMC sampling. Inspired by Dahl et al. (2005, 2008), we solve a convex optimization problem in order to use the Bayesian Information Criterion (BIC). First, we find the Bayes estimate of the inverse covariance matrix, . The Bayes estimate is defined as . We find the average of the transformed variables, , where , , are obtained from the MCMC output. Then, using the sum of squares matrix, , we solve the following to obtain the maximum likelihood estimate of the inverse covariance matrix, :
where represents the elements of that are zero and nonzero, and they are determined by the zeros of the estimated edge matrix from the MCMC. The estimated edge matrix from the MCMC sampler will be described in more detail in Section 5. This constrained optimization problem was implemented as an unconstrained optimization problem, as described in Dahl et al. (2005, 2008).
Finally, we calculate , where , the sum of the number of diagonal elements and the number of edges in the estimated edge matrix, and .
We select the combination of hyperparameters, , , , that results in the smallest BIC.
4 Posterior Consistency
Posterior consistency is a fundamental way of validating a Bayesian method using a frequentist yardstick in the large sample setting, and is of interest to both frequentists and Bayesians; for a thorough account of posterior consistency, see Ghosal and van der Vaart (2017). In Gaussian graphical models, using point mass spike-and-slab priors, Banerjee and Ghosal (2015) showed that the posterior for is consistent in the high-dimensional setting provided that , where stands for the number of non-zero off-diagonal entries of the true . With a slight modification of the arguments, it follows that the result extends to continuous spike-and-slab priors provided that the spike scale is sufficiently small with increasing . In the nonparanormal model, the main complicating factor comes from the unknown transformations , since the rest will then be as in a Gaussian graphical model. Below we argue that these transformations may be estimated consistently in an appropriate sense.
We study the posterior distributions for each transformation separately, which can be learned from the marginal likelihood for each component. Thus the problem of posterior consistency for can be generically described as follows. For brevity, we drop the index . Consider the model , where is a continuously differentiable, strictly monotone increasing transformation from to . Clearly, this model is not identifiable and hence consistent estimation is not possible in the usual sense. Identifiability can be ensured by setting and , but the procedure followed in this paper instead puts constraints on : and . We shall show that the posterior for is consistent under this set of constraints.
As the function is necessarily unbounded near and to ensure that is normally distributed, which is a distribution with unbounded support, it is clear that uniform posterior consistency for is not possible. We shall, therefore, consider the notion of uniform convergence on a compact subset of : for a fixed , the pseudo-metric to consider is . Even then, the usual posterior distribution may be highly impacted by observations near or , so we actually study a modified posterior distribution, based on observations falling within the given fixed compact subset of , with , to be described below.
Let be the true transformation function, which is assumed to be continuously differentiable and strictly monotone increasing and complying with the constraints and . Let and be respectively the true values of and . Note that the cumulative distribution function (c.d.f.) of is given by and the corresponding true c.d.f. is , where stands for the c.d.f. of the standard normal distribution.
Consider i.i.d. observations from the true distribution. Let be the number of observations falling in , the number of observations falling below and the number of observations falling above . Let be the observations falling in . The posterior consistency is based on the posterior given these complying observations , and the counts .
Observe that and . Then and . Let stand for the c.d.f. of , let stand for its true value and let be the true value of . Then we have the identity
for all .
Thus we have
We note that the posterior distributions of the quantities and can be obtained based on the counts and respectively. In particular, using a Dirichlet prior on the probability vector , we have consistency for the posterior distribution of at . We shall assume that the posterior distribution of is consistent. Note that the truncated observations alone do not lead to a posterior distribution for .
The modification in the posterior distribution of , and that we consider can be described as follows. Using the given prior on and the truncated observations , we obtain the induced posterior distribution of , while we obtain the posterior distribution on directly conditioning on . Then the posterior distribution of is induced from (4.1). Finally, the modified posterior distribution of is induced from the relations
and (4.2) in view of the restrictions and . The corresponding true values satisfy the analogous relations
The following theorem on posterior consistency refers to this modified posterior distribution rather than the original posterior distribution of . The proof can be found in the Supplementary Material.
In the above setting let the prior on and contain and in its support and independently, the prior for satisfies the condition that
Then for any ,
The condition on the prior for the transformation is satisfied by the truncated normal prior described in Section 2, and hence the transformation (as well as the mean and variance parameters and ) are consistently estimated by the posterior, as shown in the following corollary.
Let the prior on be described by , where the prior for has infinite support and is given a truncated normal prior as described in Section 2. Then for any , and hence (4.6) holds.
We conduct a simulation study to assess the performance of the Bayesian approach to graphical structure learning in nonparanormal graphical models. This method will be referred to as ‘Spike Slab’ in the results. The unobserved random variables, , are simulated from a multivariate normal distribution such that for . The means are selected from an equally spaced grid between 1 and 2 with length . We consider nine different combinations of and sparsity for :
, , sparsity = non-zero entries in the off-diagonals
, , sparsity = non-zero entries in the off-diagonals
, , sparsity = non-zero entries in the off-diagonals
, , AR(1) model
, , AR(1) model
, , AR(1) model
, , circle model
, , circle model
, , circle model
where the circle model and the AR(1) model are described by the relations