Covariance Structure Estimation with Laplace Approximation

Gaussian covariance graph model is a popular model in revealing underlying dependency structures among random variables. A Bayesian approach to the estimation of covariance structures uses priors that force zeros on some off-diagonal entries of covariance matrices and put a positive definite constraint on matrices. In this paper, we consider a spike and slab prior on off-diagonal entries, which uses a mixture of point-mass and normal distribution. The point-mass naturally introduces sparsity to covariance structures so that the resulting posterior from this prior renders covariance structure learning. Under this prior, we calculate posterior model probabilities of covariance structures using Laplace approximation. We show that the error due to Laplace approximation becomes asymptotically marginal at some rate depending on the posterior convergence rate of covariance matrix under the Frobenius norm. With the approximated posterior model probabilities, we propose a new framework for estimating a covariance structure. Since the Laplace approximation is done around the mode of conditional posterior of covariance matrix, which cannot be obtained in the closed form, we propose a block coordinate descent algorithm to find the mode and show that the covariance matrix can be estimated using this algorithm once the structure is chosen. Through a simulation study based on five numerical models, we show that the proposed method outperforms graphical lasso and sample covariance matrix in terms of root mean squared error, max norm, spectral norm, specificity, and sensitivity. Also, the advantage of the proposed method is demonstrated in terms of accuracy compared to our competitors when it is applied to linear discriminant analysis (LDA) classification to breast cancer diagnostic dataset.

Authors

• 1 publication
• 17 publications
01/12/2021

The Beta-Mixture Shrinkage Prior for Sparse Covariances with Posterior Minimax Rates

Statistical inference for sparse covariance matrices is crucial to revea...
08/13/2020

Linear pooling of sample covariance matrices

We consider covariance matrix estimation in a setting, where there are m...
01/14/2020

Sparse Covariance Estimation in Logit Mixture Models

This paper introduces a new data-driven methodology for estimating spars...
08/22/2018

Bayesian Estimation of Sparse Spiked Covariance Matrices in High Dimensions

We propose a Bayesian methodology for estimating spiked covariance matri...
07/02/2021

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

Here, we investigate whether (and how) experimental design could aid in ...
10/17/2020

Random Matrix Based Extended Target Tracking with Orientation: A New Model and Inference

In this study, we propose a novel extended target tracking algorithm whi...
07/09/2018

A partial orthogonalization method for simulating covariance and concentration graph matrices

Structure learning methods for covariance and concentration graphs are o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The sparse covariance matrix estimation problem based on a high-dimensional dataset, where the sample size exceeds the dimension of variable , has been grown in importance in multivariate data analysis since the problem is crucial to uncover underlying dependency structures among

-dimensional variables. The sparse covariance matrix estimation plays a key role in many multivariate statistical inferences, such as principal component analysis (PCA), linear discriminant analysis (LDA), and time series analysis. In the estimation, Gaussian covariance graph model provides an excellent tool, assuming multivariate normal distribution on data. Under the normality, the covariance matrix induces a bi-directed graph and the absence of an edge between two variables is equivalent to zero on the covariance between them. Hence, inference of covariance structure, or equivalently graphical structure, can lead to the estimation of covariance matrix. But the inference requires introducing sparsity to the structure, as the structure is determined by the zeros in the covariance matrix.

In the frequentist literature, introducing sparsity to the structure has been mostly done using regularization methods. The purpose of bien2011sparse and yuan2007 considered -type penalty on the negative log-likelihood. They introduced sparsity to the structure of the model using sparse and shrinkage estimators. On the other hand, bickel2008covariance, rothman2009generalized, and cai2011adaptive considered thresholding method. However, the estimated covariance matrices from these methods sometimes contradicted with a positive definite constraint on covariance matrices. Other works in the frequentist context include banding and tapering, e.g., wu2003, bickel2008regularization, and kauf2012. Their estimators are often used for Gaussian covariance graph model and are well supported by the asymptotic statistical properties.

The Bayesian methods for introducing sparsity to the covariance matrix have been also developed. The G-inverse Wishart prior, considered by silva2009hidden, was often used in the Bayesian framework. However, the method had a limitation in applications as the dimension gets higher because of posterior intractability. khare2011wishart extended G-inverse Wishart prior to a broader class and proposed a blocked Gibbs sampler to sample covariance matrices from the resulting posterior, but khare2011wishart considered only decomposable graphs. wang15

also considered covariance graph model for the inference of covariance matrices and also provided a blocked Gibbs sampler that is applicable to all graphs but under the continuous spike and slab prior, which uses the mixture of two Gaussian distributions on off-diagonal entries, one with sufficiently small variance and the other with variance substantially far from

, and the exponential prior on diagonal entries. But the method proposed by wang15 had inherent difficulty in introducing sparsity to the structure of the model due to the absolute continuity of prior. Furthermore, none of silva2009hidden, khare2011wishart, and wang15 did not attain any asymptotic statistical properties. lee2021 were able to introduce sparsity to the covariance structure using beta-mixture shrinkage prior and attained the posterior convergence rate of covariance matrix under the Frobenius norm. However, compared to frequentist literature, Bayesian literature still lacks in the methods for inference of sparse covariance matrices, and most of the proposed methods are not theoretically well supported. Furthermore, they have a limit in introducing sparsity to the covariance structures due to the absolute continuity of prior.

To fill the gap in the literature of Bayesian inference, we propose a method for estimating covariance structures under spike and slab prior which uses a mixture of point-mass and normal distribution and exponential prior on off-diagonal entries and diagonal entries of covariance matrices, respectively, which is a modified version of the prior in

wang15

. To overcome the limitations of the current Bayesian inference on sparse covariance matrices we have described, we propose a method that uses Laplace approximation to compute posterior probabilities of covariance structures, generates MCMC samples of graphs using Metropolis-Hastings algorithm proposed by

liu2019empirical and chooses the model either by median probability model (MPM) or maximum a posteriori (MAP). We estimate covariance matrix by the model of conditional posterior of covariance matrix given the structure.

Because of the enforced zero due to point-mass prior, a blocked Gibbs sampler for sampling covariance matrix from the resulting posterior or reversible jump MCMC (RJMCMC) for computing posterior model probabilities of covariance structures is not suitable in this case. To be specific, the enforced zero due to point-mass makes it difficult to derive conditional posteriors induced from the proposed prior if one considers the blocked Gibbs sampler proposed by wang15, hence bthe locked Gibbs sampler is not applicable to this case. Furthermore, if point-mass is introduced, there are models that RJMCMC has to visit over, which makes practical implementation extremely difficult as grows and the estimated posterior model probability of the covariance structure unreliable. We show that the error by Laplace approximation becomes asymptotically marginal at some rate depending on the posterior convergence rate of covariance matrix under the Frobenius norm.

One of the advantages of using the suggested prior in this paper is that the posterior is always twice continuously differentiable. In the estimation of sparse precision matrices, banerjee2015bayesian proposed the Bayesian version of graphical lasso, which uses Laplace approximation to compute posterior model probabilities of graphical structures and chooses the final model by MPM. banerjee2015bayesian considered Laplace prior on off-diagonal entries. Since Laplace prior is not differentiable at the median, Laplace approximation was applicable to only regular models. The term regularity may follow the terminology used by yuan2005 and banerjee2015bayesian. yuan2005 and banerjee2015bayesian put -type penalty on the entries of the concentration matrix, which is the graphical lasso. When an off-diagonal entry is set as zero due to graphical lasso, the integrand in Laplace approximation becomes non-differentiable. If the model includes such an off-diagonal entry as a free variable, yuan2005 and banerjee2015bayesian referred to such model as a non-regular model, and a regular model otherwise. banerjee2015bayesian showed that the posterior probability of regular models is not smaller than that of their non-regular counterparts so that one can consider only regular models if one is to choose a model by MPM. However, one had to determine the regularity of each model and it turned out that the ratio of regular models among all possible models is extremely low. Moreover, the constraint regularity led the estimated matrix to be extremely sparse. On the contrary, since the posterior induced from the prior in this paper is always twice continuously differentiable, we do not have to consider the regularity. Thus, we do not have to suffer from the drawback of banerjee2015bayesian we just mentioned. Also, even though banerjee2015bayesian put a limit on the number of edges of a graph by Frequentist graphical lasso, there were still too many models to search over and one had to judge the regularity of each model which yields inefficiency in choosing the model. However, by using the algorithm in liu2019empirical, we do not have to search over all models. Furthermore, since we force zeros on off-diagonal entries, sparsity can be naturally introduced to the structure of covariance matrix compared to those with continuous spike and slab prior in wang15 and lee2021. Hence, our prior is more useful in introducing sparsity to the covariance structure.

The paper is organized as follows. In section 2, we introduce notations and preliminaries necessary for this paper and the prior considered in this paper. Then, in section 3, we describe Laplace approximation to calculate posterior model probabilities of covariance structures and Metropolis-Hastings algorithm to generate MCMC samples of graphs from the resulting approximated posterior. We choose the final model from the MCMC samples either by MPM and MAP. Also, we show that the error by Laplace approximation becomes asymptotically negligible at some rate depending on the posterior convergence rate of covariance matrix and propose a block descent algorithm to find the mode of conditional posterior of covariance matrix as the Laplace approximation is done around it, which cannot be obtained in the closed form. We describe how to estimate the covariance matrix when the covariance structure is given using this algorithm. Finally, in section 4, we provide the simulation results for five numerical models and breast cancer diagnostic dataset. The conclusion will be discussed in 5.

2 Prior and Posterior Convergence Rate

2.1 Notations and Preliminaries

Consider the graph where is the set of nodes and is the set of edges. Let be a

edge inclusion vector, or equivalently a covairance (graphical) structure indicator, i.e.,

if or and otherwise, for . Note that in this paper we consider bi-directed graphs, especially Gaussian covariance graph model, and thus if and only if for all with . Denote the sum of entries in by , the number of edges in . Suppose two positive numerical sequences and are given. If is bounded as , we denote this by , or equivalently . If and both hold, we write .

Let be the set of all real symmetric matrices. We denote the set of all positive definite matrices by . Suppose and . Denote the Hadamard product of and by and the Kronecker product by . Let and

denote the smallest and the largest eigenvalues of

, respectively. Also, let , , and denote the spectral norm by . Note the following facts hold:

 ||A||∞≤||A||2≤||A||F≤p||A||∞, (1) ||AB||F≤||A||2||B||F. (2)

We write () if .

2.2 Prior Setting

Suppose we observe independent random samples from , where and is sparse, i.e., many of are zeros. For Bayesian inference on the sparsity of in this paper, we consider a spike and slab prior on off-diagonal entries, which uses a mixture of point-mass and Gaussian distribution, and exponential prior on diagonal entries as follows:

 πu(σij) =(1−q)δ0+qN(σij|0,v2),1≤i

where is some positive constant substantially far from , , , and is point-mass. Here we consider

to be the rate parameter of the exponential distribution. By (3) and (4), the prior on the entries of

is defined as

 πu(Σ)=∏i

The prior (5) can be equivalently defined through a hierarchical model with an edge inclusion vector ,

 πu(Σ|Z) =∏zij=1N(σij|0,v2)p∏i=1Exp(σii|λ/2), πu(Z) =∏i

Thus, one can interpret prior (3) as if with probability and if with probability for . Hence, can be seen as a covariance structure indicator, or equivalently a graphical structure indicator, following the terminology used by banerjee2015bayesian. Note is the edge acceptance probability; and the larger is, the less model becomes sparse. We may choose not too small and to avoid the extreme sparsity of the model. Compared to the prior in wang15, our prior naturally introduces sparsity to the structure of model with the point-mass at zero. Define the set

 U(τ)={C∈M+:1/τ≤λmin(C)≤λmax(C)≤τ}, (6)

where . We restrict on (6) and so consequently we have the following prior

 π(Σ)∝πu(Σ)1(Σ∈U(τ)). (7)

Note that this restriction was frequently used in many statistical inferences on sparse covariance matrices. The restriction is useful in deriving asymptotic statistical properties for our proposed method under regular conditions, though we consider as in a practical implementation.

Compared to prior in wang12 or banerjee2015bayesian, which considered Laplace prior that is not differentiable at its median on off-diagonal entries, prior in (7) is always twice continuously differentiable. Thus, if we are to compute the posterior model probabilities of using Laplace approximation, prior on off-diagonal entries in (7) takes the advantage over Laplace prior because it is twice continuously differentiable and we do not need to check the regularity of the model.

3 Posterior Computation

3.1 Posterior

Suppose follows and follows prior (7). The likelihood of is given as follows.

 π(Xn|Σ)=n∏i=11|2πΣ|1/2exp(−12XtiΣ−1Xi)∝n∏i=1|Σ|−1/2exp(−12tr(XtiΣ−1Xi))=n∏i=1exp(−12log|Σ|−12tr(XiXtiΣ−1))=exp(−n2log|Σ|−12tr(n∑i=1XiXtiΣ−1))=exp(−n2log|Σ|−n2tr(SΣ−1)), (8)

where , which is the sample covariance matrix. Note

 π(Σ|Z)πu(Z) ∝(1−q)p(p−1)/2−#Z∏zij=1qN(σij|0,v2)p∏i=1Exp(σii|λ2)1(Σ∈UZ(τ)) ∝(q1−q1√2πv)#Zexp⎛⎝−12v2∑zij=1σ2ij−λ2p∑i=1σii⎞⎠1(Σ∈UZ(τ)) =(q1−q1√2πv)#Zexp(−n2p(Σ,Z))1(Σ∈UZ(τ)), (9)

where and

 UZ(τ)={Σ=(σij)∈U(τ):σij=0 if zij=0}.

Now, we obtain the marginal posterior of

under prior (7). By Bayes’ rule, together with (8) and (9), we have the following conditional joint probability density function of

.

 π(Z,Σ|Xn) ∝π(Xn|Σ,Z)π(Σ|Z)πu(Z) ∝(q1−q1√2πv)#Zexp(−n2log|Σ|−n2tr(SΣ−1)−n2p(Σ,Z))1(Σ∈UZ(τ)) =(q1−q1√2πv)#Zexp(−n2rZ(Σ,Xn))1(Σ∈UZ(τ)), (10)

where . Let . For the notational simplicity, denote by . By (10), we have

 π(Z|Xn)∝∫ΣZ∈UZ(τ)π(Z,ΣZ|Xn)dΣZ∝(q1−q1√2πv)#Z∫ΣZ∈UZ(τ)exp(−n2rZ(ΣZ,Xn))dΣZ. (11)

Note that the posterior of is very intractable. But as is twice continuously differentiable and has a minimizer on the domain , if we choose so that is broad enough to contain the minimizer, we approximate using Laplace approximation in Section 3.2.

3.2 Laplace Approximation

In this section, we approximate in (11) using Laplace approximation. Suppose is the minimizer of on , where and is given. Then, the Laplace approximation is done around . So we have to find first. Observe that is the solution of optimization problem (12).

 MinimizeΣZ=(σZ,ij)∈UZ(τ)log|ΣZ|+tr(SΣ−1Z)+1nv2∑zij=1σ2Z,ij+λnp∑i=1σZ,ii (12)

Objective function in (12) can be seen as a regularized negative log likelihood by putting -type penalty on off-diagonal entries and -type penalty on diagonal entries. Note that is not convex on and cannot be obtained in the closed form. Hence, the optimization problem (12) can be reduced to convex optimization problem if we consider the set instead of in the optimization problem (12).

Observe that is convex on the set . Provided that belongs to the set , solving the reduced optimization problem is more desired, since the convexity makes it easier to deriving algorithm for finding the solution of the optimization problem (12). We have to resort to some numerical algorithm, as cannot be obtained in the closed form. If such algorithm converges to the stationary point of the reduced optimization problem, we obtain a local minimum of but the local minimum can be global minimum due to the convexity.

But if depends on , does not necessarily hold. Note that we assume the dependency between and to derive asymptotic statistical properties for our proposed method in this paper, which are to be discussed in Section 3.3. Thus, we pose assumptions on parameters so that . Consider the following assumptions :

1. for some constant .

2. is some constant, , , and .

Assuming (A1) and (A2), we show that with probability tending to one. For simplicity, we may consider when , since the similar argument can be applied to general , where is a vector with all entries being . Then, must satisfy the normal equation

 −Σ−1Z+Σ−1ZSΣ−1Z+Λ∘ΣZ+λnIp=0p×p,

or equivalently,

 −ΣZ+ΣZSΣZ+ΣZ(Λ∘ΣZ)ΣZ+λnΣ2Z=0p×p, (13)

where is the matrix with 0 on diagonal entries and on off-diagonal entries and and

is identity matrix and zero matrix, respectively. Since

and ,

 ||λnΣ2Z||2≤λn||ΣZ||22≤λnτ2→0, (14)

as . Also,

 ||ΣZ(Λ∘ΣZ)ΣZ||2≤||ΣZ||22||Λ∘ΣZ||2≤pτ2||Λ∘ΣZ||∞≤pτ2maxzij=1{1/nv2⋅|σZ,ij|}≤τ3pnv2→0, (15)

as . Here we used (1) in the second inequality and note that the last inequality holds by (A1) and . Since the equation (13) is equivalent to equation (16)

 2S−ΣZ=S−[ΣZ(Λ∘ΣZ)ΣZ+λnΣ2Z], (16)

(14) and (15) imply that belongs to for all sufficiently large .

Thus, assuming (A1) and (A2), the optimization problem (12) can be reduced to the following optimization problem (17)

 MinimizeΣZ=(σZ,ij)∈QZ(τ)log|ΣZ|+tr(SΣ−1Z)+1nv2∑zij=1σ2Z,ij+λnp∑i=1σZ,ii, (17)

where , which is a convex optimization problem. We consider the optimization problem (17) instead of (12).

Now it remains to solve (17). To solve (17) as to obtain , we resort to a numerical algorithm that solves (17) as cannot be obtained in the closed form. We provide a block coordinate descent algorithm for solving (17) in Section 3.4.

Provided that is in the hand, we apply Laplace approximation to (11). Let , where . Then we have

 rZ(ΣZ,Xn)=rZ(Σ∗Z,Xn)+kZ(ΔZ,Xn)−log|Σ∗Z|−tr(SΩ∗Z), (18)

where and . Substituting (18) into (11),

 π(Z|Xn) ∝(π(1−π)v1√2π)#Zexp(−n2rZ(Σ∗Z,Xn))|Σ∗Z|n2exp(n2tr(SΩ∗Z)) (19)

where . Note that is uniquely minimized at . Hence we have Hessian matrix of at , where , and apply Laplace approximation to the integral in (19). Let . Suppressing dependency on , write as and as for simplicity. By the supplementary material, we see that

 (20)

By (20), we approximate by as follows.

 π∗(Z|Xn)∝(π(1−π)v1√2π)#Zexp(−n2rZ(Σ∗Z,Xn))|Σ∗Z|n2exp(n2tr(SΩ∗Z))exp(−n2kZ(0p×p,Xn))(4πn)(p+#Z)/2∣∣HΣ∗Z∣∣−12=(π(1−π)v1√2π)#Zexp(−n2rZ(Σ∗Z,Xn))(4πn)(p+#Z)/2∣∣HΣ∗Z∣∣−12. (21)

Using this approximated posterior model probability of , which is a graphical structure indicator, we use Metropolis-Hastings algorithm proposed by liu2019empirical to generate MCMC samples of . liu2019empirical considered a symmetric proposal distribution , which samples uniformly from that differs from in only one entry. The specific description of can be found in Section 5.1 of liu2019empirical. This gives the following Metropolis-Hastings algorithm to generate MCMC samples of .

Here is defined in (21). We choose the final model by either MPM or MAP. Suppose is chosen as the final model. With and given, by Bayes’ rule, the conditional posterior of is given as following:

 π(Σ|Xn,~Z) ∝exp(−n2log|Σ|−n2tr(SΣ−1)−n2p(Σ,~Z))1(Σ∈U~Z(τ)) =exp(−n2r~Z(Σ,Xn))1(Σ∈U~Z(τ)).

Here and are defined in (9) and (10). In this paper, we estimate by the mode of . Note that the mode of is equivalent to the solution of the optimization problem (17) with . Thus, the mode of can be found using block coordinate descent algorithm provided in Section 3.4.

3.3 Error by Laplace Approximation

In this section, we show that the error by Laplace approximation becomes asymptotically marginal as in (21) under regular conditions. Note that this can established if we show that

 π(Z|Xn)/π∗(Z|Xn)→1 (22)

with probability tending to one in probability.

Denote the true covariance matrix by . Suppose that the number of nonzero off-diagonal entries in is controlled by a positive integer , where . For a covariance matrix , let be the number of edges in the covariance graph induced by . Define the set

 U(s0,τ0)={Σ∈M+:s(Σ)≤s0,1/τ0≤λmin(Σ)≤λmax(Σ)≤τ0},

where is some constant. With these notations, in addition to the assumptions (A1) and (A2), we pose additional assumptions on parameters and :

1. .

2. is some constant, , , and .

To explain (A3), the assumption (A3) was first considered in bickel2008covariance and often used in statistical inference of sparse covariances for either Frequentist literature or Bayesian literature. The positive integer controls the sparsity of true covariance matrix and the upper bound on eigenvalue of , , together with , is often used in converting to . (A4) is another assumption on parameters to attain posterior convergence rate of covariance matrix under the Frobenius norm, which turns out to be .

Note that the result of posterior convergence rate is necessary to establish , which is crucial to prove (22), where . This implies that the posterior and true covariance matrix are concentrated around the projection of true covariance matrix onto the model at rate similar to remark of (3.15) in banerjee2015bayesian. A similar argument was also made in banerjee2015bayesian. Using auxiliary results in the supplementary material, the error due to Laplace approximation tends to zero with probability tending to one if under regular conditions. This gives Theorem 3.1.

Theorem 3.1.

Assume that in (A1), (A2)-(A4). Also, suppose that , , , , and . Further suppose that , where and . Then, under prior (7),

 π(Z|Xn)/π∗(Z|Xn)→1

in probability, which implies that the error by Laplace approximation in (21) tends to zero in probability.

The proof is provided in the supplementary material. The proof uses the techniques considered by rothman2008 and banerjee2015bayesian. Since the proof uses the result of posterior convergence rate, it remains to show that is the posterior convergence rate of the covariance matrix under the Frobenius norm. Under prior (7), this follows from Theorem 3.2.

Theorem 3.2.

Let be the random sample from and consider the prior (7). Assume (A1)-(A4) and suppose that , , , , and . If ,

 π(||Σ−Σ0||F≥Mϵn|Xn)→0,

for some constant in -probability.

3.4 Block Coordinate Descent Algorithm

In this section, we propose a block coordinate descent algorithm that solves (17). For convenience, write as suppressing dependency on . Partition and as follows.

 (23)

where is a matrix and is scalar. Let and . Note that by the constraint , some entries of are fixed as 0. For simplicity, write , where is vector with entries being corresponding to , by rearranging rows/columns of and in (23). Since should be fixed due to , we update only in . With fixed , neglecting terms that do not depend on or in (17), we have

 Minimize(β1,γ)logγ+βt1[Σ−111S11Σ−111]1β1−2βt1[Σ−111s12]1+s22γ+βt1Θβ1+λn(γ+βt1[Σ−111]1β1), (24)

where and denote some principal minor matrix of and , respectively, is subvector of , and . Let be the objective function in (24). One can see that is quite similar to (4) in wang14, except for the term . Neglecting terms that do not depend on in (24), (24) is reduced to

 Minimizeγlogγ+uγ+λnγ, (25)

where . Note that with probability tending to one, which can be shown in Proposition 3.1.

Proposition 3.1.

Let as in (24). Then with probability tending to one.

Substituting and in (5) of wang14, one can see that the solution of (25), denoted by , is

 ^γ=−1+√1+4uρ2ρ. (26)

Note that the positivity of ensures the existence of (26). Now let . Note depends on and is a coordinate independent of . Neglecting terms that do depend on in , we have

 l(ξ)=ξt(Θ+λn[Σ−111]1+[Σ−111S11Σ−111]1/^γ)ξ−2ξt[Σ−1s12]1/^γ.

Since is convex, the minimizer can be found by solving . So,

 ∂l∂ξ =2(Θ+λ