DeepAI
Log In Sign Up

Bayesian Regularization for Functional Graphical Models

10/11/2021
by   Jiajing Niu, et al.
0

Graphical models, used to express conditional dependence between random variables observed at various nodes, are used extensively in many fields such as genetics, neuroscience, and social network analysis. While most current statistical methods for estimating graphical models focus on scalar data, there is interest in estimating analogous dependence structures when the data observed at each node are functional, such as signals or images. In this paper, we propose a fully Bayesian regularization scheme for estimating functional graphical models. We first consider a direct Bayesian analog of the functional graphical lasso proposed by Qiao et al. (2019). We then propose a regularization strategy via the graphical horseshoe. We compare these approaches via simulation study and apply our proposed functional graphical horseshoe to two motivating applications, electroencephalography data for comparing brain activation between an alcoholic group and controls, as well as changes in structural connectivity in the presence of traumatic brain injury (TBI). Our results yield insight into how the brain attempts to compensate for disconnected networks after injury.

READ FULL TEXT VIEW PDF
08/11/2021

Bayesian functional graphical models

We develop a Bayesian graphical modeling framework for functional data f...
10/08/2018

Modeling Brain Connectivity with Graphical Models on Frequency Domain

Multichannel electroencephalograms (EEGs) have been widely used to study...
01/27/2022

Bayesian network mediation analysis with application to brain functional connectome

Brain functional connectome, the collection of interconnected neural cir...
03/15/2012

Learning Structural Changes of Gaussian Graphical Models in Controlled Experiments

Graphical models are widely used in scienti fic and engineering research...
11/16/2017

Neighborhood selection with application to social networks

The topic of this paper is modeling and analyzing dependence in stochast...
01/13/2021

Gaussian Mixture Graphical Lasso with Application to Edge Detection in Brain Networks

Sparse inverse covariance estimation (i.e., edge de-tection) is an impor...
05/04/2019

Learning Functional Dependencies with Sparse Regression

We study the problem of discovering functional dependencies (FD) from a ...

1 Introduction

Graphical models use graphs to model and draw inferences concerning conditional independence among a collection of random variables or processes, each of which is associated with a particular location (also called a node or a vertex). They have been used to study flow cytometry between cell proteins (Friedman et al., 2008), to estimate networks from gene expression data (Li et al., 2019), and to identify communicating regions from electroencephalography (EEG) data (Qiao et al., 2019). In this work we are focused on Gaussian

graphical models, where the data follow a multivariate Gaussian distribution. In this case estimating the edge set is equivalent to identifying the nonzero elements of the precision matrix associated with the Gaussian distribution.

Broadly speaking, frequentist studies of graphical models have either involved neighborhood selection (Meinshausen et al., 2006) or the graphical lasso (Yuan and Lin, 2007; Friedman et al., 2008). The neighborhood selection method employs regression of each variable on the remaining variables with regularization, and then summarizing the neighborhoods together. But this method can be computationally intensive and thus not scalable to larger graphs. On the other hand, Friedman et al. (2008) proposed the graphical lasso via a Gaussian log-likelihood with the lasso regularization on the entire precision matrix. The glasso has proven to be useful and is a widely used procedure, due to the sparsity and convergence rates that have been studied (Lam and Fan, 2009) as well as associated computational techniques (Friedman et al., 2008; Zhu et al., 2014). A Bayesian version of the graphical lasso was proposed by Wang (2012), who illustrated potential differences between the posterior mean and the posterior mode that might be encountered. Li et al. (2019) extended the ideas of Wang (2012)

by proposing a graphical horseshoe estimator, along with an efficient Markov chain Monte Carlo

(MCMC; Gelfand and Smith, 1990) algorithm for its implementation.

To date, most of the graphical modeling literature has focused on data in which each node has an associated scalar or vector-valued response variable. However, many real world applications involve the collection of functional data at each node. In this case, we have a collection of subjects / units for whom a set of continuously-supported random functions are (discretely) observed, one function at each node, where the support may be time- or spatially-indexed, or both. For example, in neuroscience there is much interest in studying connectivity; e.g., in terms of connected regions of interest measured in functional magnetic resonance imaging

(fMRI; Shappell et al., 2019) or communicating electrodes in electroencephalography (EEG; Zhang et al., 1995)

corresponding to associated regions of neuronal activity. Alternatively, in social network analysis and marketing, it is possible to observe and record online behavior patterns among baskets of different goods for each customer over a period of time to identify related types of products. Compared to scalar or vector-valued graphical models, functional graphical models remain vastly underexplored.

Qiao et al. (2019)

proposed a functional version of the graphical lasso along with a block-coordinate descent algorithm for optimizing the loss function. Around the same time,

Zhu et al. (2016) proposed a Bayesian framework for working with functional graphical models directly in the space of infinite-dimensional random functions, essentially extending the work of Dawid and Lauritzen (1993) to function space by using hyper-inverse Wishart priors to a priori model the space of plausible, decomposable graphs. Recently, Zhang et al. (2021) proposed a Bayesian model for functional graphical models in which independent Laplace priors are placed on reparameterized partial correlations associated with basis coefficients, inducing a so-called normal hypo-exponential shrinkage prior and allowing the graph to functionally evolve over time. To the best of our knowledge, Zhu et al. (2016) and Zhang et al. (2021) are the only works on Bayesian functional graphical models.

    Our work is motivated by neuroimaging data that typically have low signal-to-noise ratios caused by non-neural noise arising from cardiac and respiratory processes or scanner instability, a problem that is exacerbated by the typically small numbers of subjects available from such studies. For instance, in Section

5.2 we study the effects of traumatic brain injury on connectivity of the human brain. The diffusion-weighted magnetic resonance imaging data consist of longitudinal measurements of white matter integrity within 26 regions of interest in 34 subjects, 17 of whom have been diagnosed with a traumatic brain injury (TBI). We aim to assess chronic structural connectivity differences between the TBI and non-TBI groups using the sparse, irregularly-measured longitudinal data — a goal for which few techniques currently exist. Further, quantifying model uncertainty is important. For example, Greenlaw et al. (2017) used an imaging genetics example to demonstrate dramatic differences in associations between genetic variations and brain imaging measures that might be identified when accounting for uncertainty in a model estimate versus using an optimization-based point estimate alone.

In this paper, we propose two different regularization schemes for functional graphical models. The first approach we consider is a direct Bayesian version of the frequentist functional graphical lasso proposed by Qiao et al. (2019). We propose also a functional graphical horseshoe, due to the horseshoe’s known improvements upon the lasso’s tendency to over-shrink large coefficients and under-shrink small coefficients in high-dimensional problems (Wang, 2012; Li et al., 2019). Whereas most existing Bayesian approaches to covariance or precision matrix estimation assume structure such as banded covariance (e.g., Banerjee and Ghosal, 2014) or decomposable graphs (e.g., Rajaratnam et al., 2008; Xiang et al., 2015; Zhu et al., 2016), neither the Bayesian functional graphical lasso nor the functional graphical horseshoe assume any structure other than sparsity. We provide efficient Gibbs sampling algorithms for both of our proposed models, exploiting auxiliary variables to produce a set of easily-sampled conditional distributions. Through extensive simulation studies, we evaluate both the classification accuracy and fidelity of the estimated coefficients. We apply our proposed Bayesian functional graphical horseshoe to two motivating datasets, the EEG alcoholic versus control study presented by Qiao et al. (2019), and a novel study of white matter connectivity between healthy patients and those with a history of traumatic brain injury using data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).
     This article is organized as follows: In Section 2, we provide a brief background for Gaussian graphical models, including basic concepts and notations for structure learning through precision matrix estimation as well as the original graphical lasso (Yuan and Lin, 2007). The Section also reviews functional principal component analysis (FPCA) and its connection to both the frequentist and Bayesian regularization approaches to functional graphical models. In Section 3, we present both of our proposed approaches, the Bayesian functional graphical lasso and the functional graphical horseshoe model, as well as efficient Gibbs samplers. We present and discuss statistical and computational performance of our proposed methods in Section 4, followed by applying the functional graphical horseshoe to the motivating applications in Section 5. A summary and concluding remarks may be found in Section 6. The code for implementing our proposed algorithms is available at https://github.com/jjniu/BayesFGM.

2 Background

2.1 Gaussian Graphical Models and the Graphical Lasso

Suppose the random vector follows a multivariate Gaussian distribution with mean and covariance matrix . Then we define as the precision matrix, or concentration matrix. A Gaussian graphical model (GGM) is based on an undirected graph , where is a non-empty set of vertices and is a set of edges representing unordered pairs of vertices (also called nodes). Each variable is represented by a node in the graph, and determines the precision matrix in that, for , if and only if . By Theorem 2.2 in Rue and Held (2005), we thus have that encodes a Markov property in the distribution. Letting and adopting the convention that for a set of indices , we have that, for any node , . Extending this with , it follows that if and only if , where denotes statistical independence. This is the pairwise Markov property. By this property, learning the graph associated with a Gaussian graphical model is equivalent to estimating the precision matrix of the multivariate Gaussian distribution, making it a covariance estimation problem (Dempster, 1972).

Given a sample , stored in a data matrix , the goal is to estimate and select non-zero elements of , thereby obtaining an estimate of the undirected graph associated with the GGM. The log-likelihood of (up to an additive constant) can be written as

(1)

where . The quantity is a convex function of and the maximum likelihood estimator of is . This estimator enjoys nice properties such as consistency, but can be unstable when . Further, even when exists, it can be an unsatisfactory estimator of due to the fact that it will generally not be sparse, even if is sparse.

To find a more stable estimator of that is simultaneously sparse, Yuan and Lin (2007) proposed to solve a lasso-type regularized version of the likelihood objective function by finding

(2)

where is the space of symmetric positive definite matrices, the norm is the sum of the absolute values of the off-diagonal elements, and is a non-negative tuning parameter. Like the original lasso (Tibshirani, 1996; Knight and Fu, 2000), Yuan and Lin (2007) showed that, for a given value, the solution to this minimization problem produces a -consistent estimator of the true precision matrix. Yuan and Lin (2007) solved this problem with the so-called maxdet algorithm (Vandenberghe et al., 1998). However, by connecting it to earlier work on the lasso, Friedman, Hastie, and Tibshirani (2008) proposed a more efficient coordinate descent algorithm for solving (2). This is the graphical lasso, an approach that has since become very popular for structure learning in GGMs.

Wang (2012) considered the fully Bayesian version of the graphical lasso by recognizing that solving (2) is equivalent to finding the maximum a posteriori (MAP) estimator in the following model,

(3)
(4)

where denotes the density of a distribution, and likewise for the double exponential () and exponential () distributions. Using a hierarchical representation of this model (Kyung et al., 2010) and matrix partitioning techniques similar to those employed by Banerjee et al. (2008) and Friedman et al. (2008), Wang (2012) developed an efficient Gibbs sampler for exploring the full posterior distribution and thus was able to extensively compare the results of the MAP and posterior mean estimators.

2.2 Functional Principal Component Analysis

For subject , let the underlying, infinite dimensional function of interest be denoted . We assume that are identically distributed and independent zero-mean functions in with covariance function , where is a compact interval on the real line. Karhunen (1946) and Loeve (1963) independently discovered the functional principal component analysis (FPCA) expansion (e.g., Bosq, 2012),

(5)

where

are the orthonormal set of eigenfunctions with corresponding eigenvalues

satisfying , by Mercer’s Theorem, and are the functional principal component (FPC) scores of , uncorrelated across with and . By assumption, the are independent across . Like ordinary principal components analysis (e.g. Jolliffe, 2002), the expansion can be truncated to obtain a finite-dimensional approximation to the infinite-dimensional process. In what follows, the proposed functional graphical models can work with any basis expansion (e.g., wavelets or Fourier), but we use FPCA due to the mean square optimality of the truncated approximation.

Performing FPCA in practice amounts to finding the spectral decomposition of an approximation to the covariance function. When , are observed on the same evenly spaced grid independent of

, this amounts to standard singular value decomposition of the sample covariance matrix. For irregularly spaced functions and/or different numbers of observations on each function, SVD will likely provide a poor approximation to the true eigensystem associated with

. In this case, Yao et al. (2005) proposed the PACE algorithm for performing FPCA via conditional expectation. In our applications, we use SVD for the EEG example and PACE for the diffusion MRI example, as the latter involves irregularly sampled longitudinal data.

2.3 Functional Graphical Models

For a particular subject , suppose we (discretely) observe functions where is the function observed on node . Suppose further that each function is a Gaussian process so that is a realization from a dimensional multivariate Gaussian process (MGP). As in typical GGMs, we associate to the MGP an undirected graph that represents the conditional dependence network. Here, conditional dependence of the functions and is in terms of the cross-covariance function,

(6)

assumed to be the same for .

With the covariance function in hand, we can use FPCA and approximate each with the -dimensional truncation,

(7)

The function for subject at node can thus be represented with the coefficient vector , so that each subject’s entire functional information over all nodes is encoded in . Under the Gaussian assumption and independently observed subjects, the Kahrunen-Loéve Theorem tells us that . For learning the graphical model, Qiao et al. (2019, Lemma 1) show that, in the finite-rank case (in which the -truncated approximation is exact),

(8)

where is the block submatrix of corresponding to the node pair and is the Frobenius norm. Thus, structure learning in the functional graphical model is equivalent to finding the pairs for which .

The connection in (8) to the graphical lasso and the group lasso (Yuan and Lin, 2006) led Qiao et al. (2019) to propose estimating the graph from functional data with

(9)

where is the sample covariance matrix computed from estimated FPC scores , found via SVD or otherwise, and is a tuning parameter. As with the group lasso, blockwise sparsity is achieved as . Qiao et al. term this approach the functional graphical lasso (fglasso). The edge set of the estimated graph is then Qiao et al. (2019) show that the fglasso enjoys model selection consistency, and provide a block coordinate descent algorithm for optimizing the objective function.

3 Bayesian Functional Graphical Models

3.1 The Bayesian Fglasso

It is well known that frequentist optimization of objective functions may often be viewed as maximum a posteriori (MAP) estimation under a Bayesian model, provided there exists a prior density corresponding to the penalty term in the objective function. For the fglasso objective function in (9), the Bayesian counterpart uses a prior on the precision matrix given by

(10)

where denotes the Frobenius norm and is the th submatrix in associated with the conditional cross-correlation between node and node , . Since the precision matrix is symmetric, we need only to consider the upper off-diagonal elements for computational simplicity. As used in the Bayesian group lasso hierarchical representation (Kyung et al., 2010), we have the following identity,

(11)

Thus, we can rewrite

as a scale mixture of a multivariate normal distribution on the off-diagonal elements. Let

, for . Then we can introduce the auxiliary latent parameters , so the prior in (10) can be attained as a gamma mixture of normals, leading to the functional graphical lasso hierarchy

(12)

Denote with the estimated -truncated functional principal component scores for the observed functions on sample , , where for simplicity we assume the truncation level is the same across all nodes. When are drawn from an MGP, follows an -dimensional Gaussian distribution. Then the Bayesian fglasso model can be expressed as

(13)

where are the diagonal elements of and is a normalizing constant.

The hierarchical representation in (12) facilitates the use of conditional conjugacy in deriving a block Gibbs sampler for exploring the posterior distribution. For a fixed regularization parameter , the posterior distribution associated with the Bayesian fglasso model (13) is given by

(14)

where is the sample scatter matrix of the functional PC scores. This representation allows us to adapt the block Gibbs sampling scheme proposed by Wang (2012).

By assumption, the off-diagonal entries of each block submatrix on the main diagonal, , are all zeros. Partition the precision and sample fpc score covariance matrix as follows:

(15)

where we define

(16)

With permuted so that the last column / row corresponds to node and score , with the collection of fpc scores at node . The vector follows from being uncorrelated with other scores at node .

Define where for and is the matrix with all ones. We similarly partition it as

(17)

where

(18)

with defined analogously to .

The conditional distribution of the nonzero variables in the last column (or row) of is

(19)

where and is the cross covariance matrix associated with the remaining nodes. We make a change of variables, , and denote . This implies

(20)

All elements in the matrix can be sampled by sampling one row and column at a time, permuting after each iteration. Due to the structure of , we first cycle through all columns corresponding to the same node, then move to next node.

After complete updating of all the off-diagonal elements, the diagonal elements of and the shrinkage parameters need to be sampled. The full conditional distributions of are seen to be independently inverse Gaussian with mean and shape . Put another way, the reparameterized model based on one particular permutation of under the Bayesian functional graphical lasso is

(21)

Since

with probability one, the positive definite constraint on

is maintained in each iteration. The argument for the functional case is adapted from that given by Wang (2012). Suppose at the current iteration the sample is positive definite, so all its corresponding leading principal minors are positive. After updating the particular column and row of by sampling and by (21), the new sample has the same leading principal minors as except the one corresponding to the updated column/row, which is of order . It is easy to find that this last leading principal minor is , where is the leading principal minor of excluding the updated column and row. Thus means that and all leading principal minors of the updated matrix are positive. To ensure each MCMC realization for , it is only required that the chain is initialized with . Algorithm 1 details the Bayesian fglasso Gibbs sampler.

Input: Sum of the products matrix , i.e., .
Output: MCMC sample of the precision matrix .
Initialization: Set to be number of nodes in graph, set initial values , , , where is identity matrix and is a matrix with all elements equal to one;
while Convergence criteria are not met do
       for  do
             Partition , and into blocks (focus on updating th column block of corresponding node );
             for  do
                   1. Partition , and as in (15) and (17);
                   2. Draw ;
                   3. Draw ,where ;
                   4. Update ;
                  
             end for
            Update by sampling for
       end for
      Store the realization of precision matrix ;
       Increment .
end while
Algorithm 1 Bayesian functional graphical lasso Gibbs sampler

Given the MCMC output of a sample of precision matrices, , several inferential procedures are possible for constructing an estimate of . Continuous shrinkage priors do not put positive probability mass on exact zeros in the precision matrix, and Carvalho et al. (2010) argue that using (non zero) posterior means as the basis for inference is often preferable to binary thresholding due to the estimator’s optimality under squared error loss. Nevertheless, it is sometimes necessary to produce a sparse estimate with exact zeros, especially in the case of graphical models. Carvalho et al. (2010) and Wang (2012) discuss some possible thresholding rules. In our case, we contruct the precision matrix (and thus graph) estimate by setting

if the marginal posterior credible interval contains zero. Note that different levels of sparsity may be achieved by using credible intervals at different levels.

The Bayesian fglasso proposed here assumes that the regularization parameter is fixed, meaning that it must be tuned and selected a priori. Cross-validation is computationally expensive, especially for Bayesian models implemented via MCMC. Further, Wasserman and Roeder (2009) show that cross-validation based on the log-likelihood loss function tends to lead to overfitting and unnecessarily dense graphs. Other than cross validation, approaches such as Akaike information criterion (AIC), Bayesian information criterion (BIC), and stability selection (Meinshausen et al., 2006)

have been well studied in the graphical model literature. In the functional case, though, AIC/BIC does not work well, since it is unclear how to calculate the effective degrees of freedom. Thus, selecting an appropriate hyperparameter

ahead of time is a nontrivial task. On the other hand, in the Bayesian framework, we can (for instance) assign a gamma prior . In this case, the full conditional for is

(22)

This can in turn be incorporated into the Gibbs sampler given in Algorithm 1 as an additional sampling step.

3.2 The Functional Graphical Horseshoe

In the presence of sparsity, as is often the case for precision matrices associated with GGMs, it is desirable to have a shrinkage approach that yields exact or values close to zero for the true null cases while simultaneously shrinking the truly non-zero cases as small as possible so that the resulting estimates have little bias. To address this desire, Carvalho et al. (2010) proposed the horseshoe prior. The prior has high probability concentration near zero and and is heavy-tailed, properties that contribute to desired shrinkage behavior. Further, the prior can be expressed as a scale mixture of Gaussian distributions and thus is easily incorporated into a Gibbs sampler for posterior exploration. Carvalho et al. (2010) originally proposed the horseshoe for the sparse normal means model, but it was recently extended by Li et al. (2019) to estimation of GGMs. Li et al. (2019) established that the resulting horseshoe estimates are close to the unbiased least-square estimates with high probability and, further, that the Bayesian graphical lasso tends to be further away from the least squares estimates than the graphical horseshoe. In this section, we propose an extension of graphical horseshoe regularization to the case of functional graphical models.

We define the functional graphical horseshoe by using horseshoe priors on each off-diagonal block of the precision matrix and exponential priors on the diagonal elements. This yields the following prior:

(23)

where and

represents the half-Cauchy distribution with density

. As in other versions of the horseshoe prior, the global shrinkage parameter is determined by the sparsity of the entire precision matrix, whereas the local shrinkage parameters preserves blocks with by allowing them to be pulled toward zero considerably less than the zero blocks. Unlike Li et al. (2019), but similar to Wang (2012), we specify an prior for the diagonal elements of . This is convenient for deriving the full conditional distributions and does not affect inference since the graph is determined by the off-diagonal elements.

The full conditional distribution of under the assumption of multivariate Gaussian likelihood is given by

(24)

where . The standard technique for creating a straightforward Gibbs sampler with the horseshoe is to use the realization of Makalic and Schmidt (2016) that if and , then, marginally, . Thus, we introduce latent variables and to facilitate conditional conjugacy when updating the shrinkage parameters and .

Under the parameter-expanded hierarchical model, the full conditional distribution of the precision matrix is given by

(25)

We can use a data-augmented Gibbs sampler with the same matrix permutation as used for the Bayesian fglasso proposed in Subsection 3.1.

In each iteration, the rows and columns of the dimensional matrices , , , and are partitioned the same way as in Subsection 3.1 to derive the full conditional distributions; i.e.,

(26)

where the blocks are arranged as before. The derivation of full conditionals for the last column and is similar to the Bayesian fglasso by changing variables. The conditional distribution of nonzero variables of the last column in is

(27)

where . Making a change of variables by , and letting , then the full conditional of is

(28)

As for the Bayesian fglasso, we first cycle through all columns corresponding to the same node, then move to next node. After the entire is updated, the local and global shrinkage parameters and need to be sampled. Through conditional conjugacy, the full conditional distributions of and are quickly seen to be inverse Gamma. The condition is maintained during each iteration as long as the starting value is positive definite, for the same reason that positive definite constraint is satisfied in the Bayesian fglasso sampler. The full Gibbs sampler is summarized in Algorithm 2.

Input: Sum of the products matrix , i.e., .
Output: Samples of precision matrix .
Initialization: Set to be number of nodes in graph, set initial values , , , , where is identity matrix and is a matrix with all elements equal to one;
while Given the current and , repeat for a large number of iterations until convergence is achieved do
       for  do
             Partition , , and into blocks (focus on updating th column block of corresponding node and all the other nodes);
             1. for  do
                   (1) Partition , , and as (26);
                   (2) Draw ;
                   (3) Draw , where ;
                   (4) Update
             end for
            2. Update , i.e., draw sample (;
             3. Update , i.e., draw sample ;
             4. Update and , i.e., (, ;
            
       end for
      Store the sample precision matrix ;
       Increment .
end while
Algorithm 2 Bayesian functional graphical horseshoe Gibbs sampler

4 Numerical Experiments

We designed simulation studies to assess the performance of posterior inference using the Bayesian functional graphical lasso and horseshoe outlined in Section 3. For our simulation studies, we considered different sample sizes (), graph sizes (), two different types of networks, and both sparse- and dense-sampled functions. This allows us to compare the frequentist fglasso of Qiao et al. (2019), Bayesian fglasso, and functional graphical horseshoe to each other across a variety of scenarios. We assess classification accuracy and fidelity of the estimates of both the zero and non-zero entries of the precision matrices in Subsection 4.1. As a matter of practicality for researchers interested in implementing the proposed techniques, we also compare the computational expense associated with each procedure across different computing environments (R, NumPy, and TensorFlow) in Subsection 4.2.

4.1 Estimation and Classification

Similar to the simulation studies considered by Qiao et al. (2019), we simulate functional data with , where contains the first five Fourier basis functions, and is a zero mean Gaussian random vector. Hence, follows a multivariate Gaussian distribution with covariance matrix , where the underlying graph is determined by the sparsity pattern of . We consider here two types of networks:

  • Network 1: A block banded matrix with , , and for , and 0 elsewhere. The network results in each node being connected to its immediate neighbors, and weaker connection to its second-order neighbors.

  • Network 2: For , the corresponding submatrices in are the same as those in Network 1 with , indicating every alternating block of 10 nodes are connected as Network 1. For we set , so the remaining nodes are fully isolated.

For each network, we generate realizations of . The observed data are then generated as