# Consistent Bayesian Sparsity Selection for High-dimensional Gaussian DAG Models with Multiplicative and Beta-mixture Priors

Estimation of the covariance matrix for high-dimensional multivariate datasets is a challenging and important problem in modern statistics. In this paper, we focus on high-dimensional Gaussian DAG models where sparsity is induced on the Cholesky factor L of the inverse covariance matrix. In recent work, ([Cao, Khare, and Ghosh, 2019]), we established high-dimensional sparsity selection consistency for a hierarchical Bayesian DAG model, where an Erdos-Renyi prior is placed on the sparsity pattern in the Cholesky factor L, and a DAG-Wishart prior is placed on the resulting non-zero Cholesky entries. In this paper we significantly improve and extend this work, by (a) considering more diverse and effective priors on the sparsity pattern in L, namely the beta-mixture prior and the multiplicative prior, and (b) establishing sparsity selection consistency under significantly relaxed conditions on p, and the sparsity pattern of the true model. We demonstrate the validity of our theoretical results via numerical simulations, and also use further simulations to demonstrate that our sparsity selection approach is competitive with existing state-of-the-art methods including both frequentist and Bayesian approaches in various settings.

## Authors

• 11 publications
• 11 publications
• 10 publications
• ### Joint Bayesian Variable and DAG Selection Consistency for High-dimensional Regression Models with Network-structured Covariates

We consider the joint sparse estimation of regression coefficients and t...
04/20/2020 ∙ by Xuan Cao, et al. ∙ 0

• ### Bayesian Linear Regression for Multivariate Responses Under Group Sparsity

We study the frequentist properties of a Bayesian high-dimensional multi...
07/10/2018 ∙ by Bo Ning, et al. ∙ 0

• ### Effective Learning of a GMRF Mixture Model

Learning a Gaussian Mixture Model (GMM) is hard when the number of param...
05/18/2020 ∙ by Shahaf E. Finder, et al. ∙ 16

• ### Bayesian Graph Selection Consistency For Decomposable Graphs

Gaussian graphical models are a popular tool to learn the dependence str...
01/14/2019 ∙ by Yabo Niu, et al. ∙ 0

• ### Learning the Structure for Structured Sparsity

Structured sparsity has recently emerged in statistics, machine learning...
03/10/2015 ∙ by Nino Shervashidze, et al. ∙ 0

• ### The Beta-Mixture Shrinkage Prior for Sparse Covariances with Posterior Minimax Rates

Statistical inference for sparse covariance matrices is crucial to revea...
01/12/2021 ∙ by Kyoungjae Lee, et al. ∙ 0

• ### Informative Gaussian Scale Mixture Priors for Bayesian Neural Networks

Encoding domain knowledge into the prior over the high-dimensional weigh...
02/24/2020 ∙ by Tianyu Cui, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Covariance estimation and selection is a fundamental problem in multivariate statistical inference, and plays a crucial role in many data analytic methods. In high-dimensional settings, where the number of variables is much larger than the number of samples, the sample covariance matrix (traditional estimator for the population covariance matrix) can perform rather poorly. See (Bickel and Levina, 2008a, b; El Karoui, 2007) for example. To address the challenge posed by high-dimensionality, several promising methods have been proposed in the literature. In particular, methods inducing sparsity in the covariance matrix , its inverse , or the Cholesky factor of the inverse, have proven to be very effective in applications. In this paper, we focus on imposing sparsity on the Cholesky factor of the inverse covariance (precision) matrix. These models are also referred to as Gaussian DAG models.

Consider a case when we have i.i.d. observations obeying a

-variate normal distribution with mean vector

and covariance matrix . Let be the unique modified Cholesky decomposition of the inverse covariance matrix , where is a lower triangular matrix with unit diagonal entries, and is a diagonal matrix with positive diagonal entries. A given sparsity pattern on corresponds to certain conditional independence relationships, which can be encoded in terms of a directed acyclic graph on the set of variables as follows: if the variables and do not share an edge in , then (see Section 2 for more details).

There are two major approaches in the literature for sparse estimation of . The first approach is based on regularized likelihood/pseudolikelihood using penalization. See (Huang, Liu, Pourahmadi, and Liu, 2006; Rutimann and Buhlmann, 2009; Shojaie and Michailidis, 2010; Rothman, Levina, and Zhu, 2010; Aragam, Amini, and Zhou, 2015; Yu and Bien, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017). Some of these frequentist approaches assume is banded, i.e., the elements of that are far from the diagonal are taken to be zero. The other methods put restrictions on the maximum number of non-zero entries in .

On the Bayesian side, when the underlying graph is known, literature exists that explores the posterior convergence rates for Gaussian concentration graph models, which induce sparsity in the inverse covariance matrix . See (Banerjee and Ghosal, 2014, 2015; Xiang, Khare, and Ghosh, 2015; Lee and Lee, 2017) for example. Gaussian concentration graph models and Gaussian DAG models studied in this paper intersect only at perfect DAG models, which are equivalent to decomposable concentration graphical models. For general Gaussian DAG models, comparatively fewer works have tackled with asymptotic consistency properties. Recently, Cao, Khare, and Ghosh (2019) establish both strong model selection consistency and posterior convergence rates for sparse Gaussian DAG models in a high-dimensional regime. In particular, the authors consider a hierarchical Gaussian DAG model with DAG-Wishart priors introduced in (Ben-David, Li, Massam, and Rajaratnam, 2016) on the Cholesky parameter space and independent Bernoulli

priors for each edge in the DAG (the so-called Erdos-Renyi prior). However, the sparsity assumptions on the true model required to establish consistency are rather restrictive. In addition, as a result of the extremely small value of the edge probability

in the Bernoulli prior, the simulations studies always tend to favor more sparse models under smaller values of . Lee, Lee, and Lin (2018) also explore the Cholesky factor selection consistency under the empirical sparse Cholesky (ESC) prior and -posteriors. Compared with (Cao, Khare, and Ghosh, 2019), under relaxed conditions in terms of the dimensionality, sparsity and lower bound of the non-zero elements in the Cholesky factor, Lee, Lee, and Lin (2018) establish strong model selection consistency with the -posterior distribution.

It recently came to our attention that two more flexible alternative priors compared to the Erdos-Renyi prior have been considered in the undirected graphical models literature: (a) the multiplicative prior (Tan, Jasra, De Iorio, and Ebbels, 2017), and (b) the beta-mixture prior (Carvalho and Scott, 2009). Both priors are more diverse than the Erdos-Renyi prior (the Erdos-Renyi prior can be obtained as a degenerate version of these priors), and have various attractive properties. For example, the multiplicative model prior can account for greater variability in the degree distribution as compared to the Erods-Renyi model, while the beta-mixture prior allows for stronger control over the number of spurious edges and corrects for multiple hypothesis testing automatically. We provide the algebraic forms of these priors in Section 3 and Section 5 respectively, and refer the reader to (Carvalho and Scott, 2009; Tan, Jasra, De Iorio, and Ebbels, 2017) for a detailed discussion of their properties.

To the best of our knowledge, a rigorous investigation of high-dimensional posterior consistency properties with the multiplicative prior or the beta-mixture prior has not been undertaken for either undirected graphical models or Gaussian DAG models. Hence, our goal was to investigate if high-dimensional consistency results could be established under these two more diverse and algebraically complex class of prior distributions in the Gaussian DAG model setting. Another goal was to investigate if these high-dimensional posterior consistency results can be obtained under much weaker conditions as compared to (Cao, Khare, and Ghosh, 2019), particularly conditions similar to those in (Lee, Lee, and Lin, 2018). This was a challenging goal, particularly for the multiplicative model prior, as the prior mass function is not available in closed form (note that the mass functions for the Erdos-Renyi, ESC and beta-mixture priors are available in closed form).

As the main contributions of this paper, we establish high-dimensional posterior consistency results for Gaussian DAG models with spike and slab priors on the Cholesky factor , under both the multiplicative prior as well as the beta-mixture prior on the sparsity pattern in (Theorems 4.1 to 5.3), using assumptions similar to those in (Lee, Lee, and Lin, 2018) (where a different setting of ESC priors and -posteriors is used). Also, through simulation studies, we demonstrate that the models studied in this paper can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings.

The rest of paper is organized as follows. Section 2 provides background material regarding Gaussian DAG model and introduce the spike and slab prior on the Choleksy factor. In Section 3, we revisit the multiplicative prior, and present our hierarchical Bayesian model and the parameter class for the inverse covariance matrices. Model selection consistency results for both the multiplicative prior and the beta-mixture prior are stated in Section 4 and Section 5 with proofs provided in Section 7. In Section 6 we use simulation experiments to illustrate the posterior ratio consistency result, and demonstrate the benefits of our Bayesian approach and computation procedures for Choleksy factor selection vis-a-vis existing Bayesian and penalized likelihood approaches. We end our paper with a discussion session in Section 8.

## 2 Preliminaries

In this section, we provide the necessary background material from graph theory, Gaussian DAG models, and also introduce our spike and slab prior on the Cholesky parameter.

### 2.1 Gaussian DAG Models

We consider the multivariate Gaussian distribution

 Y∼Np(0,Ω−1), (1)

where is a inverse covariance matrix. Any positive definite matrix can be uniquely decomposed as , where is a lower triangular matrix with unit diagonal entries, and is a diagonal matrix with positive diagonal entries. This decomposition is known as the modified Cholesky decomposition of (see for example Pourahmadi (2007)). In particular, the model (1) can be interpreted as a Gaussian DAG model depending on the sparsity pattern of .

A directed acyclic graph (DAG) consists of the vertex set and an edge set such that there is no directed path starting and ending at the same vertex. As in (Ben-David, Li, Massam, and Rajaratnam, 2016; Cao, Khare, and Ghosh, 2019), we will without loss of generality assume a parent ordering, where that all the edges are directed from larger vertices to smaller vertices. For several applications in genetics, finance, and climate sciences, a location or time based ordering of variables is naturally available. For example, in genetic datasets, the variables can be genes or SNPs located contiguously on a chromosome, and their spatial location provides a natural ordering. More examples can be found in (Huang, Liu, Pourahmadi, and Liu, 2006; Shojaie and Michailidis, 2010; Yu and Bien, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017). The set of parents of , denoted by , is the collection of all vertices which are larger than and share an edge with . Similarly, the set of children of , denoted by , is the collection of all vertices which are smaller than and share an edge with .

A Gaussian DAG model over a given DAG , denoted by , consists of all multivariate Gaussian distributions which obey the directed Markov property with respect to a DAG . In particular, if and , then

 Yi⊥Y{i+1,…,p}∖pai(D)|Ypai(D),

for each . Furthermore, it is well-known that if is the modified Cholesky decomposition of , then if and only if whenever . In other words, the structure of the DAG is uniquely reflected in the sparsity pattern of the Cholesky factor . In light of this, it is often more convenient to reparametrize the inverse covariance matrix in terms of the Cholesky parameter .

### 2.2 Notations

Consider the modified cholesky decomposition , where is a lower triangular matrix with all the unit diagonals and , where ’s are all positive. We suggest to impose spike and slab priors on the lower diagonal of

to recover the sparse structure of the Cholesky factor. To facilitate this purpose, we introduce latent binary variables

for to indicate whether is active, i.e., if and 0, otherwise. We can view the binary variable as the indicator for the sparsity pattern of . In other words, for each , let , a subset of , be the index set of all non-zero components in . explicitly gives the support of the Cholesky factor and the sparsity pattern of the underlying DAG. Denote as the cardinality of set for .

Following the definition of , for any matrix , denote the column vectors and Also, let ,

 A≥iZ=[Aii(A>Z.i)TA>Z.iA>iZ].

In particular, .

Next, we provide some additional required notation. For , let and represent the standard and norms. For a matrix , let

be the ordered eigenvalues of

and denote

 ∥A∥max=max1≤i,j≤p|Aij|,
 ∥A∥(r,s)=sup{∥Ax∥s:∥x∥=1}, for 1≤r,s<∞.

In particular,

 ∥A∥(1,1)=maxj∑i|Aij|, ∥A∥(∞,∞)=maxi∑j|Aij| and ∥A∥(2,2)=eigp(A)12.

### 2.3 Spike and Slab Prior on Cholesky Parameter

In this section, we specify our spike and slab prior on the Cholesky factor as follows.

 Lkj∣dj,Zkjind∼ZkjN(0,τ2dj)+(1−Zkj)δ0(Lkj),1≤j

for some constants , where denotes a point mass at . We refer to (2) and (3) as our spike and slab Cholesky (SSC) prior. implies being the “signal” (i.e., from the slab component), and implies

being the noise (i.e., from the spike component). Note that to obtain our desired asymptotic consistency results, appropriate conditions for these hyperparameters will be introduced in Section

4.1. Xu and Ghosh (2015) also impose this type of priors on the regression factors. Further comparisons and discussion are provided in Remark 3.

###### Remark 1.

Note that in (3), we are allowing the hyperparameters for the inverse-gamma prior to be zero. In (Cao, Khare, and Ghosh, 2019), the DAG-Wishart prior with multiple shape parameters introduced in (Ben-David, Li, Massam, and Rajaratnam, 2016) is placed on the Cholesky parameter. As indicated in Theorem 7.3 in (Ben-David, Li, Massam, and Rajaratnam, 2016)

, the DAG-Wishart distribution defined on the Cholesky parameter space given a DAG yields the independent inverse-gamma distribution with strictly positive shape and scale parameters on

and multivariate Gaussian distribution on the non-zero elements in each column of given . Hence, for given DAG structures, there are some difference and connection between the DAG-Wishart prior and our spike and slab prior.

## 3 Model Specification

In this section, we revisit the multiplicative prior introduced in (Tan, Jasra, De Iorio, and Ebbels, 2017) over space of graphs, and specify our hierarchical model.

### 3.1 Multiplicative Prior

In the context of Gaussian graphical model, Tan, Jasra, De Iorio, and Ebbels (2017) allow the probability of a link between nodes and , to vary with by taking and for each . The authors further treat each as a variable with a beta prior to adopt a fully Bayesian approach. The authors further utilize Laplace approximations, and through simulation studies, show that the proposed multiplicative model (following the nomenclature in (Tan, Jasra, De Iorio, and Ebbels, 2017)) facilitates the purpose to encourage sparsity or graphs that exhibit particular degree patterns based on prior knowledge. Adapted to our framework, we consider the following multiplicative prior over the space of sparsity variation for the Cholesky factor.

 π(Z∣ω1,…,ωp)=∏1≤j

where are positive constants. Compared with the universal indicator probability in an Erdos-Renyi prior, here we allow the variation attainable in the degree structure of each node through different values of . Note that under the multiplicative prior, the marginal posterior for can not be obtained in closed form, which leads to further challenges not only in the theoretical analysis, but also in the computational strategy. We will elaborate on this matter in Section 6.

### 3.2 Hierarchical Model Formulation

Let be independent and identically distributed -variate Gaussian vectors with mean and true covariance matrix , where is the modified Cholesky decomposition of . Let denotes the sample covariance matrix. The sparsity pattern of the true Choleksy factor is uniquely encoded in the true binary variable denoted as . Similar to (Cao, Khare, and Ghosh, 2019), we also denote as the maximum number of non-zero entries in any column of , and . For sequences and , means for some constant , as . Let represent as .

The class of spike and slab Cholesky distributions in Section 2 and the multiplicative priors in Section 3.1 can be used for Bayesian model selection of the Cholesky factor through the following hierarchical model,

 (6) Lkj∣dj,Zkjind∼ZkjN(0,τ2dj)+(1−Zkj)δ0(Lkj),1≤j

where

represents the beta distribution with shape parameters

. The proposed hierarchical model now has five hyperparameters: the scale parameter in model (7

) controlling the variance of the spike part in the spike and slab prior on each

, the shape parameter and scale parameter in model (8), and the two positive shape parameters in the beta distribution in model (10). Further restrictions on these hyperparameters to ensure desired consistency will be specified in Section 4.1.2.

The intuition behind this set-up with latent variables is that the elements in the Cholesky factor with zero or very small values will be identified with zero

values, while the active entries will be classified as

. We use the posterior probabilities of all the

latent variables to identify the active elements in . In particular, the following lemmas help specify the upper bound for the marginal probability ratio and the marginal posterior ratio for any “non-true” model compared with the true model under the multiplicative prior. The proof will be provided in Section 7.1.

###### Lemma 3.1.

If the hyperparameter in model (10) satisfies , for , we have

 π(Z)π(Z0)≤e2α21+2α1+2α1p∏j=1B(α1+|Zj|,α2)B(α1+|Z0j|,α2), (11)

for .

Lemma 3.1 further enables the marginalized posterior likelihood ratio to be upper bounded by decomposed prior terms absorbed into the product of items as follows.

###### Lemma 3.2.

If , for , the marginal posterior ratio between any “non-true” model and the true model under the multiplicative prior in (4) and (5) satisfies

 π(Z|Y)π(Z0|Y) ≤ M1p−1∏j=1(nτ2)−|Zj|−|Z0j|2B(α1+|Zj|,α2)B(α1+|Z0j|,α2) ×|~S≥jZ0|12|~S≥jZ|12⎛⎝~Sj|Z0j~Sj|Zj⎞⎠12⎛⎜ ⎜⎝~Sj|Z0j−1nτ2n,p+2λ2n~Sj|Zj−1nτ2n,p+2λ2n⎞⎟ ⎟⎠n2+λ1 ≜ M1×p−1∏j=1PR′j(Z,Z0), (12)

where , and .

## 4 Model Selection Consistency

In this section we will explore the high-dimensional asymptotic properties of the Bayesian model selection approach for the Cholesky factor specified in Section 3.2. For this purpose, we will work in a setting where the dimension of the data vectors, and the hyperparameters vary with the sample size and . Assume that the data is actually being generated from a true model specified as follows. Let be independent and identically distributed -variate Gaussian vectors with mean and true covariance matrix , where is the modified Cholesky decomposition of . The sparsity pattern of the true Choleksy factor is reflected in . Recall the definition in Section 3.2 that is the maximum number of non-zero entries in any column of , and . In order to establish our asymptotic consistency results, we need the following mild assumptions with respective discussion/interpretation.

### 4.1 Assumptions

#### 4.1.1 Assumptions on the True Parameter Class

###### Assumption 1.

There exists , such that for every .

This assumption ensures that the eigenvalues of the true precision matrices are bounded by fixed constants, which has been commonly used for establish high dimensional covariance asymptotic properties. See for example (Bickel and Levina, 2008a; El Karoui, 2008; Banerjee and Ghosal, 2014; Xiang, Khare, and Ghosh, 2015; Banerjee and Ghosal, 2015). Previous work (Cao, Khare, and Ghosh, 2019) relaxes this assumption by allowing the lower and upper bounds on the eigenvalues to depend on and .

###### Assumption 2.

as .

This is a much weaker assumption for high dimensional covariance asymptotic than for example, (Xiang, Khare, and Ghosh, 2015; Banerjee and Ghosal, 2014, 2015; Cao, Khare, and Ghosh, 2019). Here we essentially allow the number of variables to grow slower than compared to previous literatures with rate .

###### Assumption 3.

as .

Recall that is the smallest (in absolute value) non-zero off-diagonal entry in . Hence, this assumption also known as the “beta-min” condition also provides a lower bound for the “slab” part of

that is needed for establishing consistency. This type of condition has been used for the exact support recovery of the high-dimensional linear regression models as well as Gaussian DAG models. See for example

(Lee, Lee, and Lin, 2018; Yang, Wainwright, and Jordan, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017; Cao, Khare, and Ghosh, 2019; Yu and Bien, 2016).

###### Remark 2.

It is worthwhile to point out that our assumptions on the true Cholesky factor are weaker compared to (Lee, Lee, and Lin, 2018). In particular, Lee, Lee, and Lin (2018) introduce conditions A(2) and A(4) on the sparsity pattern of the true Cholesky factor such that the number of non-zero elements in each row as well as each column of to be smaller than some constant , while in this paper, we are allowing the maximum number of non-zero entries in any column of to grow at a smaller rate than (Assumption 2).

#### 4.1.2 Assumptions on the Prior Hyperparameters

###### Assumption 4.

for all satisfying , where .

This assumption essentially states that the prior on the space of the possible models, places zero mass on unrealistically large models (see similar assumptions in (Johnson and Rossell, 2012; Shin, Bhattacharya, and Johnson, 2018; Narisetty and He, 2014) in the context of regression). Assumption 4 is also more relaxed compared with Condition (P) in (Lee, Lee, and Lin, 2018) where for some constant . Note that this condition is for the hyperparameter of the prior distribution on the latent variables only, which does not affect the true parameter space.

###### Assumption 5.

The hyperparameter in (9) satisfies and , as , for some constant .

This assumption provides the rate at which the variance of the slab prior is required to grow to guarantee desired model selection consistency. Similar conditions on the hyperparameter can be seen in (Narisetty and He, 2014; Shin, Bhattacharya, and Johnson, 2018; Johnson and Rossell, 2012).

###### Assumption 6.

There exists a constant , such that the hyperparameters in model (8) satisfy and the shape parameters in model (10) satisfies , , for , for some .

This assumption provides the rate at which the shape parameter needs to grow to ensure desired consistency. Previous literature with Erdos-Renyi priors puts restrictions on the rate of the edge probability. In particular, previous work (Cao, Khare, and Ghosh, 2019) assumes , where for some to penalize large models. Similar assumptions on the hyperparameters can be also found in (Yang, Wainwright, and Jordan, 2016; Narisetty and He, 2014) under regression setting. In Section 6.2, we will see the proposed model without specifying particular values for helps avoiding the potential computation limitation such as simulation results always favor the most sparse model.

For the rest of this paper, , , ,, , will be denoted as , , , , , , by leaving out the superscript for notational convenience.

### 4.2 Posterior Ratio Consistency

We now state and prove the main model selection consistency results. The proofs for all the theorems will be provided in Section 7.1 and Section 7.2. Our first result establishes what we refer to as “posterior ratio consistency” (following the terminology in (Cao, Khare, and Ghosh, 2019)). This notion of consistency implies that the true model will be the mode of the posterior distribution among all the models with probability tending to as

###### Theorem 4.1.

Under Assumptions 1-6, the following holds:

 maxZ≠Z0π(Z|Y)π(Z0|Y)¯P→0, as n→∞.

### 4.3 Model Selection Consistency for Posterior Mode

If one was interested in a point estimate of which reflects the sparsity pattern of , the most apparent choice would be the posterior mode defined as

 ^Z=argmaxZπ(Z|Y). (13)

From a frequentist point of view, it would be natural to obtain if we have model selection consistency for the posterior mode, which follows immediately from posterior ratio consistency established in Theorem 4.1, by noting that Therefore, we have the following corollary.

###### Corollary 4.1.

Under Assumptions 1-6, the posterior mode is equal to the true model with probability tending to , i.e.,

 ¯P(^Z=Z0)→1, as n→∞.
###### Remark 3.

In the context of linear regression, Xu and Ghosh (2015) tackle the Bayesian group lasso problem. In particular, the authors propose the following hierarchical Bayesian model:

 Y∣X,β,σ2∼N(Xβ,σ2I) βgind∼σ2,τ2g∼(1−π0)N(0,σ2τ2gI)+π0δ0(βg),g=1,2,…,G, τ2gind∼Gamma (mg+12,λ22),g=1,2,…,G, σ2ind∼Inverse-Gamma (α1,α2).

In particular, they impose an independent spike and slab type prior on each factor (conditional on the variance parameter ), and an inverse Gamma prior on the variance. Each regression factor is explicitly present in the model with a probability . In this setting under an orthogonal design, the authors in (Xu and Ghosh, 2015) establish oracle property and variable selection consistency for the median thresholding estimator of the regression coefficients on the group level. Note that with parent ordering, the off-diagonal entries in the column of can be interpreted as the linear regression coefficients corresponding to fitting the variable against all variables with label greater than . Hence, there are similarities with respect to the model and consistency results between (Xu and Ghosh, 2015) and this work. However, despite these similarities, fundamental differences exist in these models and the corresponding analysis. Firstly, the number of groups (or factors) is considered to be fixed in (Xu and Ghosh, 2015), while we allow the number of predictors to grow at an exponential rate of in a ultra high-dimensional setting, which creates more theoretically challenges. Secondly, the ‘design’ matrices corresponding to the regression coefficients in each column of which can be represented as functions of the sample covariance matrix are random and correlated with each other, while (Xu and Ghosh, 2015) only considers the orthogonal design where with no correlation introduced. Thirdly, the consistency result in (Xu and Ghosh, 2015) focuses only on group level selection only and is tailored for problems that only require group level sparsity, while our model can induce sparsity in each individual element of . The authors also propose a Bayesian hierarchical model referred to as Bayesian sparse group lasso to enable shrinkage both at the group level and within a group. However, no consistency results are addressed regarding this model. Lastly, in our model, each coefficient is present independently with multiplicative prior that incorporates information that is sparse, which is not the case in (Xu and Ghosh, 2015) as each factor is present with . In particular, all the aspects discussed above lead to major differences and further challenges in analyzing the ratio of posterior probabilities.

### 4.4 Strong Model Selection Consistency

Next we establish another stronger result (compared to Theorem 4.1) which implies that the posterior mass assigned to the true model converges to 1 in probability (under the true model). Following (Narisetty and He, 2014; Cao, Khare, and Ghosh, 2019), we refer to this notion of consistency as strong selection consistency.

###### Theorem 4.2.

Under Assumptions 1-6, the following holds:

 π(Z0|Y)¯P→1, as n→∞.
###### Remark 4.

We would like to point out that our posterior ratio consistency and strong model selection consistency do not require any additional assumptions on bounding the maximum number of edges. In particular, Cao, Khare, and Ghosh (2019) consider only the DAGs with the total number of edges at most for . By the assumptions in the previous work, it follows that the DAGs in the analysis do not include the models where the Cholesky factor has one or more non-zero elements for each column, since , as , while in our result, each row can have at most number of non-zero entries as indicated in Assumption 4. Hence, our strong model selection consistency results is more general than (Cao, Khare, and Ghosh, 2019; Lee, Lee, and Lin, 2018) in the sense that the consistency holds for a larger class of DAGs.

## 5 Results for Beta-mixture Prior

Though the multiplicative prior could allow variation among the indicator probabilities, the intractable marginal posteriors remain problematic in practice. The authors in (Tan, Jasra, De Iorio, and Ebbels, 2017) address this issue via Laplace approximation. However, the computational cost for that will become extensive as increases. To obtain the marginal posterior probabilities in closed form and for ease of computation, we consider the following beta-mixture prior over the space of introduced in (Carvalho and Scott, 2009),

 Zkj∣qi.i.d∼Bern(q),1≤j

where

denotes the Bernoulli distribution with probability

, and represents the beta distribution with shape parameters . We refer to model (14) and (15) as the beta-mixture prior over the space of latent variables indicating the sparsity structure for the Cholesky factor.

###### Remark 5.

Cao, Khare, and Ghosh (2019); Banerjee and Ghosal (2015) introduce an Erdos-Renyi type of distribution on the space of DAGs as the prior distribution for DAGs, where each directed edge is present with probability independently of the other edges. In particular, they define , to be the edge indicator and let , be independent identically distributed Bernoulli(

Cao, Khare, and Ghosh (2019) establish the DAG selection consistency under suitable assumptions. while Banerjee and Ghosal (2015) address the estimation consistency, and provide high-dimensional Laplace approximations for the marginal posterior probabilities for the graphs. In our framework, we extend the previous work by putting a beta distribution on the edge probability . The beta-mixture type of priors have previously been placed on graphs for simulation purpose in (Carvalho and Scott, 2009), but the theoretical properties have yet to be investigated. A clear advantage of such an approach as indicated in (Carvalho and Scott, 2009) is that treating the previous fixed tuning constant as a model parameter shrinks the graph size to a data-determined value of , and allows strong control over the number of spurious edges.

In order to obtain the posterior consistency for , we need the following lemma, which specifies the closed form for the marginal posterior density of with proof provided in Section 7.3.

###### Lemma 5.1.

The marginal posterior density under the beta-mixture prior satisfies

 π(Z|Y) ∝ B(α1(p−1)+p−1∑j=1|Zj|,α2(p−1)+p(p−1)2−p−1∑j=1|Zj|) ×p−1∏j=1⎛⎝n~Sj|Zj2−12τ2+λ2⎞⎠−n2−λ1|~S>jZ|−12(nτ2)|Zj|/2 = B(α1(p−1)+p−1∑j=1|Zj|,α2(p−1)+p(p−1)2−p∑j=1|Zj|) ×p−1∏j=1⎛⎝n~Sj|Zj2−12τ2+λ2⎞⎠−n2−λ1(|~S≥iZ|~Sj|Zj)−12(nτ2)|Zj|/2, (16)

in which , and The second equation follows from

In particular, these posterior probabilities can be used to select a model representing the sparsity pattern of by computing the posterior mode that maximize the posterior densities. The convenient closed form for the marginal posterior in (5.1) also yields nice posterior ratio consistency under the following weaker assumption on compared with Assumption 6.

###### Assumption 7.

There exists a constant , such that the hyperparameters in model (8) satisfy and the shape parameters in model (10) satisfies , .

###### Theorem 5.2.

Under Assumptions 1-5 and 7, the following holds under the beta-mixture prior:

 maxZ≠Z0π(Z|Y)π(Z0|Y)¯P→0, as n→∞.

The next theorem establishes the strong selection consistency under the beta-mixture prior. See proofs for Theorem 5.2 and Theorem 5.3 in Section 7.3.

###### Theorem 5.3.

Under Assumptions 1-6, for the beta-mixture prior, the following holds:

 π(Z0|Y)¯P→1, as n→∞.
###### Remark 6.

We would like to point out that posterior ratio consistency (Theorem 5.2 does not require any restriction on (the rate of the shape parameter in the beta distribution (15)) that will be growing, this requirement is only needed for strong selection consistency (Theorem 5.3). Similar restrictions on the hyperparameters have been considered for establishing consistency properties in the regression setup. See (Yang, Wainwright, and Jordan, 2016; Lee, Lee, and Lin, 2018; Cao, Khare, and Ghosh, 2018) for example.

The closed form for the marginal posterior probability in (5.1) is convenient for showing the consistency. However, when it comes to simulation, the beta term in (5.1) pertaining to the beta-mixture prior is often too large, and could sometimes blow up when

is relatively large. In addition, for the beta-mixture prior, probability

is assumed to be universal across all indicators, which seems not flexible and diverse enough. In the following section, we will take on the task to investigate and evaluate the simulation performance for both the multiplicative model and the beta-mixture model.

## 6 Simulation Studies

In this section, we demonstrate our main results through simulation studies. First recall from (5.1) that the marginal posterior distributions for under the beta-mixture prior can be derived analytically in closed form (up to a constant) in (5.1). Therefore, we can evaluate the parameter space more clearly with this naturally assigned “score”, that is the posterior probability.

For the multiplicative prior, the () can not be integrated out, thus the closed form for the marginal distribution of can not be conveniently acquired. As indicated in (Tan, Jasra, De Iorio, and Ebbels, 2017), evaluating the marginal densities via Monte Carlo becomes more computationally intensive as the dimension increases. Therefore, the authors propose to estimate these quantities efficiently through Laplace approximation instead. Detailed functional and Hessian expressions can be found in the supplemental material in (Tan, Jasra, De Iorio, and Ebbels, 2017). Here we adopt the same Laplace approximation for estimating the marginal densities for . However, as we will see in Figure 3, though the multiplicative prior could potentially lead to better model selection performance, the additional procedure when evaluating each individual posterior probability could be quite time consuming. In particular, the Newton-type algorithm used for obtaining the mode of the log-likelihood runs extremely slow in higher dimensions.

### 6.1 Simulation I: Illustration of Posterior Ratio Consistency

In this section, we illustrate the consistency result in Theorem 4.1 and Theorem 5.2 using a simulation experiment. Our goal is to show that the log of the posterior ratio for any “non-true” model compared to the true model will converge to negative infinity. To serve this purpose, we consider different values of ranging from to , and choose . Next, for each fixed , a lower triangular matrix with diagonal entries and off-diagonal entries is constructed. In particular, unlike in previous work (Cao et al., 2019) where the expected value of non-zero entries in each column of does not exceed 3, here we randomly chose 3% or 5% of the lower triangular entries of the Cholesky factor and set them to be 0.5. The remaining entries were set to zero.

The purpose of this setting is to show our consistency requires more relaxed sparsity assumptions on the true model compared to (Cao, Khare, and Ghosh, 2019). We refer to this matrix as . The matrix also reflects the true underlying DAG structure encoded in . Next, we generate i.i.d. observations from the distribution, and set the hyperparameters as , , , for . The above process ensures all the assumptions are satisfied. We then examine posterior ratio consistency under four different cases by computing the log posterior ratio of a “non-true”model and as follows.

1. Case : Model is a submodel of and the number of total non-zero entries of is exactly half of , i.e. .

2. Case : is a submodel of and the number of total non-zero entries of is exactly twice of , i.e. .

3. Case : is not necessarily a submodel of , but satisfying the number of total non-zero entries in is half the number of non-zero entries in .

4. Case : is not necessarily a submodel of , but the number of total non-zero entries in is twice the number of non-zero elements in .

The log of the posterior probability ratio for various cases under two different sparsity settings and our two different priors are provided in Figure 1. As expected the log of the posterior probability ratio decreases to large negative numbers as becomes large in all four cases and in both sparsity settings and under both sparsity priors, thereby providing a numerical illustration of Theorem 4.1.

We would like to point out that in (Cao, Khare, and Ghosh, 2019), the log of posterior ratios are almost all positive real numbers, when and the expected value of non-zero entries in each column of does not exceed 3, which indicates the hierarchical model with DAG-Wishart distribution and the Erdos-Renyi type of prior over graphs only performs better with really higher dimension and much more sparse settings. In particular, this leads to one potential drawback of using the DAG-Wishart distribution coupled with the Erdos-Renyi type of prior on the Cholesky space, as in real applications, extremely high-dimensional and sparse data sets are not very commonly seen, while our spike and slab Cholesky prior with the beta-mixture or multiplicative prior is more adaptable and diverse in that aspect.

### 6.2 Simulation II: Illustration of Model Selection

In this section, we perform a simulation experiment to illustrate the potential advantages of using our Bayesian model selection approach. We consider values of ranging from to , with . For each fixed , the Cholesky factor of the true concentration matrix, and the corresponding dataset, are generated by the same mechanism as in Section 6.1. Then, we perform model selection on the Cholesky factor using the four procedures outlined below.

1. Lasso-DAG with quantile based tuning: We implement the Lasso-DAG approach in (Shojaie and Michailidis, 2010) by choosinf penalty parameters (separate for each variable ) given by , where denotes the quantile of the standard normal distribution. This choice is justified in (Shojaie and Michailidis, 2010) based on asymptotic considerations.

2. ESC Metropolis-Hastings algorithm: We implement the Rao-Blackwellized Metropolis-Hastings algorithm for the ESC prior introduced in (Lee, Lee, and Lin, 2018) for exploring the space of the Cholesky factor. The hyperparameters and the initial states are taken as suggested in (Lee et al., 2018). Each MCMC chain for each row of the Cholesky factor runs for 5000 iterations with a burn-in period of 2000. All the active components in with inclusion probability larger than 0.5 are selected. We would like to point out that since the Metropolis-Hastings algorithm needs to be executed for each row of , the procedure could be extremely time consuming, especially in higher dimensions.

3. DAG-Wishart log-score path search: The hierarchical DAG-Wishart prior (Cao et al., 2019) also gives us the closed form to calculate the marginal posterior up to a constant. In particular,

 π(D|Y)=π(D)π(Y)(√2π)nzD(U+nS,n+α(D))zD(U,α(D)),

where is the normalized constant in the DAG-Wishart distrbution and

 π(D)=∏(i,j):1≤i

with , where . Follow the simulation procedures in previous work (Cao et al., 2019). We set the hyperparameters as and for and generate candidate graphs by thresholding the modified Cholesky factor of ( is the sample covariance matrix) on a grid from 0.1 to 0.5 by 0.0001 to get a sequence of graphs. The log posterior probabilities are computed for all candidate graphs, and the graph with the highest probability is chosen. As we discussed previously, we will see in Figure 2 that for the previous DAG-Wishart model, we always end up choosing the most sparse estimator, since the graph obtained at the thresholding value 0.5 always has the highest log posterior score. Hence, we observe that the choice though could guarantee the model selection consistency, makes the posterior stuck in very small size models and we are not able to detect the true model.

4. Spike and slab Cholesky with beta-mixture prior/multiplicative prior: For our Bayesian approach with spike and slab Cholesky prior and beta-mixture/multiplicative prior on the sparsity pattern of , we adopt the similar procedure as DAG-Wishart log-score path search method. We construct two candidate sets as follows.

1. All the Cholesky factors with respect to the graphs on the solution paths for Lasso-DAG, CSCS and DAG-Wishart are included in our Cholesky factor candidate set.

2. To increase the search range, we also generate additional graphs by thresholding the modified Cholesky factor of ( is the sample covariance matrix) on a grid from 0.1 to 0.5 by 0.0001 to get a sequence of additional Cholesky factors, and include them in the candidate set. We then search around all the above candidates using Shotgun Stochastic Search Algorithm in (Shin et al., 2018) to generate even more candidate Cholesky factors. In particular, the authors in (Shin et al., 2018) claim that the simplified algorithm can significantly lessen the simulation runtime and increase the model selection performance.

The log posterior probabilities are computed for all Cholesky factors in the candidate sets using (5.1), and the one with the highest probability is chosen. In Figure 2, we plot the log of marginal posterior densities under the spike and slab Cholesky prior and the multiplicative/beta-mixture prior for all the Cholesky factors under different thresholding values compared with the marginal posteriors under previous DAG-Wishart model. Unlike the DAG-Wishart distribution always favor the most sparse Cholesky factor corresponding to the largest thresholding value, we observe the maximum log posterior score occurs in the middle of the curve for our proposed models, which leads to the significant improvement of the model selection results shown in Table 1 and Table 2.

The model selection performance of these four methods is then compared using several different measures of structure such as positive predictive value, true positive rate and mathews correlation coefficient (average over independent repetitions). Positive Predictive Value (PPV) represents the proportion of true non-zero entries among all the entries detected by the given procedure, True Positive Rate (TPR) measures the proportion of true non-zero entries detected by the given procedure among all the non-zero entries from the true model. PPV and TPR are defined as

 PPV=TPTP + FP,TPR=TPTP + FN.

Mathews correlation Coefficient (MCC) is commonly used to assess the performance of binary classification methods and is defined as

 MCC=TP×TN−FP×FN√(FP + TN)×(TP+FN)×(TN+FP)(TN+FN),

where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. Note that the value of MCC ranges from -1 to 1 with larger values corresponding to better fits (-1 and 1 represent worst and best fits, respectively). Similar to MCC, one would also like the PPV and TPR values to be as close to as possible. The results are provided in Table 1 and Table 2, corresponding to different true sparsity levels. In Figure 4, we draw the heatmap comparison between the true and estimated using our Bayesian spike and slab Cholesky approach under two different sparsity levels when .

It is clear that our hierarchical fully Bayesian approach with beta-mixture prior and multiplicative prior outperforms the penalized likelihood approaches, the Bayesian DAG-Wishart and ESC approach based on almost all measures. The PPV values for our Bayesian spike and slab Cholesky approach are all above , while the ones for the penalized likelihood approach and ESC are below . Though the PPV for the DAG-Wishart approach is almost 1, it is actually a consequence of the maximized log score occurring at the most sparse model. Hence, The precision (PPV) for the DAG-Wishart method is rather high, as the resulting is extremely sparse and all the remaining non-zero entries are the true elements in . The TPR values for the proposed approaches are almost all beyond , while the ones for the penalized likelihood approaches are all below . Now again under this measure, as a result of the final sparse estimator, DAG-Wishart Bayesian approach performs very poorly compared to the spike and slab approach with beta-mixture/multiplicative prior. For the most comprehensive measure of MCC, our fully Bayesian approach outperforms all the other three methods under all the cases of and two different sparsity levels.

It is also meaningful to compare the computational runtime between different methods. In Figure 3, we plot the run time comparison between our spike and slab Cholesky with beta-mixture prior/multiplicative prior and ESC. Since the marginal posterior is available in closed form (up to a constant) for the SSC with beta-mixture prior, we can see that the run time for SSC-B via thresholding coupled with stochastic search is significantly lessened compared to the MCMC approach. The computational cost of ESC is extremely expensive in the sense that it requires not only additional run time, but also larger memory (more than 30GB when ). On the other hand, for the multiplicative prior, though the model selection performance is almost the best among all the competitors, with the extra step of the Laplace approximation for calculating each posterior probability, the computational burden is quite extensive as increases.

Overall, this experiment illustrates that the proposed hierarchical fully Bayesian approach with our spike and slab Cholesky prior and the beta-mixture prior can be used for a broader yet computationally feasible model search, while our spike and slab Cholesky prior with the multiplicative prior though more computationally expensive, can lead to a much more significant improvement in model selection performance for estimating the sparsity pattern of the Cholesky factor and the underlying DAG.

## 7 Proofs

In this section, we take on the task of proving our main results presented in Theorems 4.1 to 5.3.

### 7.1 Proof of Theorem 4.1

The proof of Theorem 4.1 will be broken into several steps. We begin our strong selection consistency proof by first proving the Lemma 3.1 and Lemma 3.2 which give the upper bound for the prior ratio between any “non-true” model and the true model .

###### Proof of Lemma 3.1.

First note that following from model (9) and (10), we have

 π(Z)= ∫p∏j=1π(ωj)π(Z|ω1,…,ωp)dω1…dωp = ∫∏1≤j