 # Graphical Exponential Screening

In high dimensions we propose and analyze an aggregation estimator of the precision matrix for Gaussian graphical models. This estimator, called graphical Exponential Screening (gES), linearly combines a suitable set of individual estimators with different underlying graphs, and balances the estimation error and sparsity. We study the risk of this aggregation estimator and show that it is comparable to that of the best estimator based on a single graph, chosen by an oracle. Numerical performance of our method is investigated using both simulated and real datasets, in comparison with some state-of-art estimation procedures.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical models [8, 22] have become as a useful way of exploring and modeling the distribution. For instance, graphical models could be used to represent complex interactions among gene products resulted from biological processes. Such problems require us to infer an undirected graph from i.i.d. observations.

Let

be a random vector with some continuous distribution. An undirected graph

has vertices, collected in a set , one for each variable. We represent the edges as a set of unordered pairs: if and only if there is an edge between and . An edge between and is absent if and are independent, given the other variables.

The default model for graphical modeling of continuous data is the multivariate Gaussian. Let data be the realizations of

independent samples from a multivariate Gaussian distribution

, where is the covariance matrix. Then the log-likelihood of is (up to a constant) given by

 ln(Θ):=n2logdet(Θ)−n2tr(ˆΣnΘ), (1)

where is the precision matrix, i.e. inverse covariance matrix, and is the empirical covariance matrix.

For Gaussian graphical models, it is well known that the edge between the th and th nodes is absent in the graph, meaning that the associated variables are conditionally independent given the other variables, if and only if , where is the th element of . Therefore, the estimation and model selection problems in Gaussian graphical models are equivalent to estimation of the precision matrix and identification of its zero-pattern.

While it is one of the classical problems in multivariate statistics, with a renewed focus on high-dimensional data in recent years, a number of sparse estimators have been proposed to deal with the problem of precision matrix estimation. Among them,

Yuan and Lin  and Friedman et al.  impose an penalty on the entries of the precision matrix when maximizing the Gaussian log-likelihood, known as the graphical lasso, encouraging some of the entries of the estimated precision matrix to be exactly zero. Meinshausen and Bühlmann  consider the neighborhood selection method via the lasso. Cai et al.  propose a constrained minimization approach for sparse inverse covariance matrix estimation. Yuan 

takes advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, developing an estimating procedure that can effectively exploit sparsity. Theoretical properties, including consistency in parameter estimation and sparsity structure recovery, are discussed in these and other papers

[15, 18].

Given a collected family of estimators, linear or convex aggregation methods are another class of technique to address model selection problems and provide flexible ways to combine various models into a single estimator . The idea of aggregating estimators was originally described in . The suggestion put forward by  is to achieve two independent subsamples from the original sample by randomization: individual estimators are constructed from the first subsample while the second is used to perform aggregation on those individual estimators. This idea of two-step procedures carries over to models with i.i.d. observations where one can do sample splitting. Along with this method, one might aggregate estimators using the same observations for both estimation and aggregation. However, this would generally result in overfitting.

A primary motivation for aggregating estimators is that it can improve the estimation risk, as “betting” on multiple models can provide a type of insurance against a single model being poor . Most of the recent work on estimator aggregation deals with regression learning problems. For example, Exponential Screening (ES) for linear models provides a form of frequentist averaging over a large model class, which enjoys strong theoretical properties 

. Also, an aggregation classifier is proposed in

 and an optimal rate of convex aggregation for the hinge risk is also obtained.

In this paper, we propose a new estimating procedure by considering a linear combination of a suitable set of individual estimators with different sparsity patterns. A sparsity pattern is defined as a binary vector with each element indicating whether the corresponding edge of the graph is absent or not. These individual estimators and the corresponding aggregating weights are determined to ensure a competitive rate of convergence for the risk of the aggregation estimator.

Our aggregation method is based on a sample-splitting procedure: the first subsample is set to construct individual estimators and the second subsample is then used to determine the weights and aggregate these estimators. To carry out the analysis of the aggregation step, it is enough to work conditionally on the first subsample so that the problem reduces to aggregation of deterministic estimators . Namely, let denote a risk function, then given the deterministic estimator with sparsity pattern , one can construct an aggregation estimator such that the excess risk

 r:=R(ˆΘ)−minm∈MR(Θm), (2)

is as small as possible, where

is a candidate set of sparsity patterns. Ideally, we wish to find an aggregation estimator whose risk is as close as possible (in a probabilistic sense) to the minimum risk of individual estimators. The risk function considered in the paper is the Kullback-Leibler divergence.

The rest of the paper is organized as follows. In Section 2, we describe the graphical aggregation in detail. Theoretical properties are provided in Section 3. Numerical experiments are presented in Sections 4. Finally, we conclude in Section 5. Detailed proofs are collected in Appendix section.

## 2 Graphical Exponential Screening Estimator

### 2.1 Notations and preliminaries

Let denote the vector norm, denote the spectral matrix norm, and denote the Frobenius matrix norm. For any symmetric positive-definite matrix we use the notation . For any two real numbers and , we use the notation . Let denote the indicator function.

We call a sparsity pattern any binary vector , where is a candidate set of sparsity patterns and . The th coordinate of can be interpreted as an indicator of presence () or absence () of the edge such that is the th element of the ordered list , where .

We partition the sample into two independent subsamples, and , of size and , respectively, where .

### 2.2 Graphical aggregation method

In the aggregation procedure, is utilized to construct individual estimators and is then used to aggregate these estimators. Given each sparsity pattern , we first define the individual estimator of the precision matrix.

###### Definition 2.1.

For each , let be the edge set of the graph with sparsity pattern . Let be the constrained maximum likelihood estimator, defined by

 ˆΘ(1)m=argmaxΘ∈Cm{logdet(Θ)−\emphtr(ˆΣ(1)n1Θ)}, (3)

where

 Cm ={Θ≻0:θij=0\emph for any (i,j)∉Em\emph and i≠j}, (4)

and denotes the empirical covariance matrix using the first subsample .

Notice that each individual estimator maximizes the log-likelihood under the constraint that some pre-defined subset of the parameters are zero. If where is the maximal clique size of a minimal chordal cover of the graph with edge set , estimator exists and is unique . The following relationships hold regarding and its inverse :

 [ˆΘ(1)m]ij =0,∀(i,j)∉Em, and (5) ij =[ˆΣ(1)n1]ij,∀(i,j)∈Em∪{(i,i),i=1,…,p}. (6)

Indeed we can drive the above properties via the Lagrange form, where we add Lagrange constants for all missing edges of

 ln1,m(Θ):=logdet(Θ)−tr(ˆΣ(1)n1Θ)−∑(i,j)∉Emγijθij, (7)

where is a Lagrange constant. The gradient equation for maximizing (7) can be written as

 Θ−1−ˆΣ(1)n1−Γ=0, (8)

where is a matrix of Lagrange parameters with nonzero values for all pairs with edges absent. It is an equality-constrained convex optimization problem, and a number of methods have been proposed for solving it, for example, in Hastie et al. .

Now we define the aggregation estimator, which linearly combines a set of individual estimators for .

###### Definition 2.2.

Let be the graphical Exponential Screening (gES) estimator that linearly combines a set of individual estimators , defined by

 ˆΘ(1,2)\emphgES=∑m∈Mv(1,2)mˆΘ(1)m, (9)

where the superscripts denote which subsamples are used for constructions, and the weights

 v(1,2)m:=exp{(n2/2)(\emphlogdet(ˆΘ(1)m)−\emphtr(ˆΘ(1)mˆS(2)n2))}πm∑m′∈Mexp{(n2/2)(\emphlogdet(ˆΘ(1)m′)−\emph% tr(ˆΘ(1)m′ˆS(2)n2))}πm′. (10)

Here, is an estimator of , and

is a (prior) probability distribution on the set of sparsity pattern

, defined by

 πm:=1H(|m|1ep(p−1))|m|1, (11)

where is a normalization factor to ensure that add up to one.

Observe that the gES estimator is a linear combination of the set of individual estimators with weights . It is indeed a convex combination since the weights add up to one.

A natural choice of would be the empirical covariance matrix . In this scenario, ignoring the prior distribution , the individual weight is proportional to the likelihood of estimator evaluated on the second subsample. Thus, the higher the predictive ability, the more weighting will be put on the corresponding individual estimator. However, as is shown in the next section, this would lead to a deterioration of convergence rate in high-dimensional settings. Instead, we can use the hard thresholding estimator proposed in Bickel and Levina . To be more specific, the thresholding estimation of thresholded at is defined by

 S(2)n2=Tγ(ˆΣ(2)n2):={ˆσ(2)ij⋅1(|ˆσ(2)ij|≥γ)}1≤i,j≤p, (12)

where is the th element of . In practice, we can apply the following procedure for the problem of threshold selection : we split the second subsample randomly into two pieces of size and , respectively, and repeat this times. Let and be the empirical covariance matrices based on the two pieces, from the th split. Then the thresholding parameter is determined by

 γ=argminγ′{1BB∑v=1∥Tγ′(ˆΣ(2)1,v)−ˆΣ(2)2,v∥2F}. (13)

In Definition 2.2, we also incorporate a deterministic factor into the weighted averaging to account for (prior) model complexity, in a manner that facilitates desirable risk properties [10, 16]. Here, low-complexity models are favored. Alternatively, if the set of sparsity patterns

, the following deterministic factor specifies a uniform distribution on the cardinality of the subset and a conditionally uniform distribution on the subsets of that size:

 πUm=[(p(p−1)/2+1)(p(p−1)/2|m|1)]−1. (14)

In addition, a simple way is to choose a flat prior, where we set for any .

We show in the next section that, under some technical conditions, the risk of the gES estimator is bounded by the risk of the best individual estimator plus a low price for aggregating.

### 2.3 Approximation algorithm

To implement the estimating procedure, note that exact computation of the aggregation estimator might require the calculation of as many as

individual estimators. In many cases this number could be extremely large, and we must make a numerical approximation. Observing that the aggregation estimator is actually the expectation of a random variable that has a probability mass proportional to

on individual estimator for , then the Metropolis-Hastings algorithm can be exploited to provide such an approximation. The detail algorithm is shown in Algorithm 2.1.

###### Algorithm 2.1.

If the set of sparsity patterns , then the gES estimator can be approximated by running a Metropolis-Hastings algorithm on a -dimensional hypercube:

• Initialize ;

• For each , generate with the uniform distribution on the neighbours of ;

• Generate an -uniformly distributed number ;

• Put , if ; otherwise,

• Compute . Stop if ; otherwise, update and go to step (S2).

Then we can approximate by

 ˆˆΘ(1,2)\emphgES=1TT0+T∑t=T0+1ˆΘ(1)mt, (15)

where and are positive integers.

Here, the neighbours of consists of all the sparsity patterns with a Manhattan distance of one to

. The following proposition shows that the resulting Markov chain ensures the ergodicity.

###### Proposition 2.1.

The Markov chain generated by Algorithm 2.1 satisfies

 limT→∞1TT0+T∑t=T0+1ˆΘ(1)mt=∑m∈Mv(1,2)mˆΘ(1)m,almost surely. (16)

The proof is straightforward as the Markov chain is clearly -irreducible.

The Metropolis-Hastings algorithm incorporates a trade-off between prediction and sparsity to decide whether to add or discard an edge. Observe that the gES estimator would always estimate a precision matrix in which all the elements are nonzero, since all possible individual estimators are linearly mixed. However, the Metropolis-Hastings algorithm would lead to a sparse precision matrix estimation as in the regression case .

## 3 Theoretical Properties

In this section, we show that under some technical conditions, the risk of the gES estimator is bounded by the risk of the best individual estimator in the dictionary plus a low aggregating price.

Following the notations in Bickel and Levina , we define the uniformity class of covariance matrices invariant under permutations by

 U(q,δ,M)={Σ∈Rp×p:Σ≻0,σii≤M,p∑j=1|σij|q≤δ,for all i}, (17)

for , where and are constants.

For any estimator , we define the risk function

 R(ˆΘ)=tr(ˆΘΣ)−% logdet(ˆΘ). (18)

Note that is the true covariance matrix. Consider the Kullback-Leibler (KL) divergence

 KL(ˆΘ)=R(ˆΘ)−R(Θ)≥0. (19)

For each individual estimator corresponding to , we assume exists and is unique. The following proposition relates the KL risk of the aggregation estimator to the KL risks of individual estimators .

###### Proposition 3.1.

The gES estimator in Definition 2.2 satisfies the following inequality

 \emphKL(ˆΘ(1,2)\emphgES)≤ minm∈M{\emphKL(ˆΘ(1)m)+2n2log1πm+\emphtr((ˆΘ(1)m−ˆΘ(1,2)\emphgES)(ˆS(2)n2−Σ))} (20) ≤ minm∈M{\emphKL(ˆΘ(1)m)+2n2log1πm+\emphtr(ˆΘ(1)m(ˆS(2)n2−Σ))+∥ˆS(2)n2−Σ∥⋅\emphtr(ˆΘ(1,2)\emphgES). (21)

It is shown in Vershynin  that under some conditions

 ∥ˆΣ(2)n2−Σ∥=OP(√pn2), (22)

thus if we use the empirical covariance matrix as , Proposition 3.1 implies that this would lead to a deterioration of convergence rate in high dimensions. Then we choose to use the thresholded covariance matrix estimation in Bickel and Levina .

We consider the following assumption for further analysis of the remainder term.

###### Assumption 3.1.

The set of sparsity patterns satisfies the following condition

 maxm∈M\emphtr(ˆΘ(1)m)=OP(p). (23)

This assumption ensures a fast convergence rate of the aggregation estimator. It is shown in Dahl et al.  that the inverse of is a solution to the following problem that maximizes the determinant of a symmetric positive definite matrix :

 maxZ≻0 logdetZ, (24) s.t. Zij=[ˆΣ(1)n1]ij,∀(i,j)∈Em∪{(i,i),i=1,…,p}.

In general, note that for any two graphs and with sparsity patterns and , respectively, if is a subgraph of , then the trace of would always be larger than that of . Thus the most dense graph in the dictionary would always be able to achieve the maximum trace among all individual estimators. This assumption claims that the diagonal entries of the true precision matrix are well estimated by all individual estimators in the dictionary .

Let be the sparsity pattern that attains the minimum KL risk in the dictionary :

 m∗=argmin m∈MKL(ˆΘ(1)m). (25)

The following theorem shows the oracle inequality that the aggregation estimator satisfies.

###### Theorem 3.1.

Suppose Assumption 3.1 hold. Uniformly on , for sufficiently large , if the thresholding parameter , and , then the gES estimator satisfies

 \emphKL(ˆΘ(1,2)\emphgES)−minm∈M\emphKL(ˆΘ(1)m)=OP(s∗logpn2+p(logpn2)(1−q)/2), (26)

where is the number of nonzero off-diagonal elements of .

This theorem yields a rate of convergence of the excess risk. In particular, if the set of sparsity patterns also includes the true sparsity pattern . Let be the number of nonzero off-diagonal elements of . It is shown in Zhou et al.  that under some technical conditions

 KL(ˆΘ(1)m0)=OP((s0+p)logpn2). (27)

Combine it with Theorem 3.1 and assume that , we obtain

 KL(ˆΘ(1,2)gES)=OP(p(logpn2)(1−q)/2), (28)

for .

## 4 Experimental Results

In this section, we provide empirical evidence to illustrate the usefulness of the proposed gES estimator and compare it with other state-of-the-art methods in parameter estimation and graph recovery using simulated and real datasets.

### 4.1 Numerical simulations

We generate synthetic datasets with sample size or , and number of nodes or . We use the following three models for simulating graphs and precision matrices. Figure 1 displays a typical run of the generated graphs of the precision matrices when .

• “AR”: The off-diagonal th element of the adjacency matrix is set to be 1 if and 0 otherwise;

• “Hub”: The vertices are evenly partitioned into disjoint groups. Each group is associated with a center vertex in that group and the off-diagonal th element of the adjacency matrix is set to 1 if also belongs to the same group as and 0 otherwise;

• “Random”: The off-diagonal th element of the adjacency matrix is randomly set to be 1 with certain probability and 0 otherwise. Figure 1: An illustration of the three graph patterns with p=100 nodes.

We compare the proposed gES estimator with the graphical lasso (glasso) [6, 25] and constrained minimization estimator (CLIME)  with the tuning parameters determined by 10-fold cross-validation. We also consider the method of multiple testing of hypotheses about vanishing partial correlation coefficients [4, 5], where the family-wise error rates for incorrect edge inclusion are controlled by Bonferroni correction at level . We refer this method as “pcorTest” in this paper. For the gES estimator, the full dataset is random partitioned into two subsamples with equal size, while for other methods, the graphs are estimated using the full dataset in order to make the comparison fair.

Notice that when is large, the Metropolis-Hastings algorithm for the gES estimator might take a large number of iterations to generate a good random tour over the hypercube with as many as dimensions, and thus could become computationally unsuitable. In this case, we need to approximate the model space and select a candidate set of edges before the execution of the Metropolis-Hastings algorithm. We identify a candidate set of edges by applying some pre-screening method to the first subsample . Let the ordered list be the set of selected edges. Then our aggregation process is constructed on this subset of edges. Then the set of sparsity patterns is given by

 M=p(p−1)/2∏k=1Ck, (29)

where

 Ck:={{0,1}if (ik,jk)∈Qp, where (ik,jk) is the kth % element of Sp{0}otherwise. (30)

In practice where is large, the Metropolis-Hastings algorithm introduced in Algorithm 2.1 will be applied to this reduced set of sparsity patterns instead of the set of all possible edges . Many pre-screening methods could be considered here, for instance, the glasso. Note that we prefer to choose a small regularization parameter to prevent any true edge being ruled out in this stage. An alternative algorithm could be based on the regularization paths from the glasso method. Let be a set of regularization parameters. For each , let be the sparsity pattern resulted from glasso with tuning parameter . Then the aggregation method is constructed on the set of sparsity patterns . However, in practical applications, the Metropolis-Hastings algorithm based on the regularization paths always concentrations on a single model after convergence, and generally does not perform well.

For any estimator , we use the following criteria for the comparisons.

• Squared Frobenius norm error:

 Frobenius=∥ˆΘ−Θ∥2F; (31)
• Kullback-Leibler loss:

 KL =−logdet(ˆΘ)+trace(ˆΘΣ)−(−logdet(Θ)+p); (32)
• Precision: the number of correctly estimated edges divided by the total number of edges in the estimated graph. Let be a -dimensional graph and let be an estimated graph. We define

 Precision=|ˆE∩E||ˆE|; (33)
• Recall: the number of correctly estimated edges divided by the total number of edges in the true graph, defined by

 Recall=|ˆE∩E||E|; (34)
• -score: a weighted average of the precision and recall, defined by

 F1-score=2⋅Precision⋅Recall% Precision+Recall, (35)

where an -score reaches its best value at 1 and worst score at 0.

Table 1 shows the simulation results of the quantitative comparison of different methods, where we repeat the experiments 50 times and report the averaged values with their standard errors. We can see that the gES and CLIME estimators perform better than glasso and pcorTest in term of the squared Frobenius norm errors, while gES and glasso are better than CLIME and pcorTest regarding the Kullback-Leibler loss. For the comparison of graph structure recovery, the gES and pcorTest estimators outperform other methods.

Figure 2 displays the evolution of the Metropolis-Hastings algorithm, showing evidence that the algorithm converges after 4000 iterations. Figure 3 shows a typical realization of the gES method, varying the regularization parameter in the pre-screening glasso, where we can see that the results for graphical aggregation are not quite sensitive to the pre-screening method. Figure 2: Number of selected edges by the Metropolis-Hastings algorithm as a function of iterations, given a typical run of the gES estimator for Hub graph on simulated data with (n,p)=(400,100). Figure 3: Typical realization of the gES estimator for Hub graph on simulated data with (n,p)=(400,100), varying the regularization parameter λ in the pre-screening glasso; Number of selected candidate edges (marked as E) are also reported.

The computational complexity of glasso  which uses a row-by-row block coordinate algorithm is roughly . For the gES estimator, the complexity is roughly , where is the number of iterations in the Metropolis-Hastings algorithm. Note that can be dramatically reduced if we keep track of and store the individual estimators to avoid duplicated computation of precision matrix with the same sparsity pattern.

### 4.2 Analysis of microarray data

In this study, we consider a real-world dataset based on Affymetrix GeneChip microarrays for the plant Arabidopsis thaliana . The sample size is . A nonparanormal transformation is estimated and the expression levels for each chip are replaced by their respective normal scores, subject to a Winsorized truncation . A subset of genes from the isoprenoid pathway are chosen, and we study the associations among them using the proposed gES estimator and the glasso method with tuning parameter determined by cross-validation.

The results show that glasso selects 378 edges and gES selects 111 edges. Among those selected edges, 102 edges are identified by both methods. Figure 4 shows grids of rectangles with gray scale corresponding to the absolute values in the estimated precision matrix for each method. Figure 4: Grids of rectangles with gray scale corresponding to the absolute values in the estimated precision matrix for each method.

We also analyze a dataset on microarrays for the gene expression levels . Of the 4238 genes in immortalized B cells for normal individuals, we select genes that are associated with the phenotypes in genome-wide association studies. We study the estimated graphs obtained by the glasso and gES estimators. The expression levels for each gene are pre-processed by log-transformation and standardization.

The results indicate that glasso selects 12514 edges and the gES estimator selects 1631 edges. Among those, 1542 edges are identified by both methods. Figure 5 provides the estimated graphs. Figure 5: Estimated graphs for microarray data example for immortalized human B cells.

## 5 Conclusion

In this paper, we propose a new aggregation method for estimating the precision matrix in Gaussian graphical models, by considering a convex combination of a suitable set of individual estimators with different underlying sparsity patterns. We investigate the risk of this aggregation estimator and show by an oracle that it is comparable to the risk of the best estimator based on a single graph. Experimental results validate the usefulness of our method in practice.

## Acknowledgements

This paper is based on part of my Ph.D dissertation. I want to thank my advisor, Dr. John Lafferty, for his help and guidance on this project.

## Appendix A Appendix

### a.1 Proof of Proposition 3.1.

###### Proof.

For any and individual estimator , we obtain

 KL(ˆΘ(1)m)=tr(ˆΘ(1)mΣ)−logdet(ˆΘ(1)m)−p+logdet(Θ). (36)

Similarly, for the aggregation estimator

 KL(ˆΘ(1,2)gES)=tr(ˆΘ(1,2)gESΣ)−% logdet(ˆΘ(1,2)gES)−p+logdet(Θ). (37)

Based on the convexity of , we obtain

 KL(ˆΘ(1,2)gES)≤∑m∈Mv(1,2)mKL(ˆΘ(1)m). (38)

Let being any sparsity pattern attaining

 minm∈M{KL(ˆΘ(1)m)+2n2log1πm+tr% (ˆΘ(1)m(ˆS(2)n2−Σ))}. (39)

Then according to the definition of , we obtain

 v(1,2)~mv(1,2)m= exp{(n2/2)(logdet(ˆΘ(1)~m)−tr(ˆΘ(1)~mˆS(2)n2))}π~mexp{(n2/2)(logdet(ˆΘ(1)m)−tr(ˆΘ(1)mˆS(2)n2))}πm (40) = π~mπmexp{−(n2/2)[KL(ˆΘ(1)~m)−KL(ˆΘ(1)m)+tr((ˆΘ(1)~m−ˆΘ(1)m)(ˆS(2)n2−Σ))]}, (41)

where the last equality follows from the fact that

 KL(ˆΘ(1)m)=tr(ˆΘ(1)m(Σ−ˆS(2)n2))−(logdet(ˆΘ(1)m)−tr(ˆΘ(1)mˆS(2)n2))−p+logdet(Θ). (42)

Note that . Then the inequality (38) can be written as

 KL(ˆΘ(1,2)gES) ≤KL(ˆΘ(1)~m)+∑m∈Mv(1,2)m(KL(ˆΘ(1)m)−KL(ˆΘ(1)~m)) (43) ≤KL(ˆΘ(1)~m)+2n2∑m∈Mv(1,2)mlogv(1,2)~mv(1,2)m+2n2∑m∈Mv(1,2)mlogπmπ~m +∑m∈Mv(1,2)mtr(ˆΘ(1)~m(ˆS(2)n2−Σ))−∑m∈Mv(1,2)mtr(ˆΘ(1)m(ˆS(2)n2−Σ)) (44) ≤KL(ˆΘ(1)~m)+2n2∑m∈Mv(1,2)mlogπmv(1,2)m+2n2∑m∈Mv(1,2)mlogv(1,2)~mπ~m +tr(ˆΘ(1)~m(ˆS(2)n2−Σ))−∑m∈Mv(1,2)mtr(ˆΘ(1)m(ˆS(2)n2−Σ)). (45)

According to the fact that

 logv(1,2)~m≤0, and ∑m∈Mv(1,2)mlogπmv(1,2)m≤0, (46)

we obtain

 KL(ˆΘ(1,2)gES)≤ KL(ˆΘ(1)~m)+2n2log1π~m+tr(ˆΘ(1)~m(ˆS(2)n2−Σ))−tr(ˆΘ(1,2)gES(ˆS(2)n2−Σ)). (47)

It is shown in Wang et al.  that for any real symmetric matrix and any positive semidefinite matrix

 λp(A)tr(B)≤% tr(AB)≤λ1(A)tr(B), (48)

where is the

th largest eigenvalue of

.

Following this property, we obtain

 ∣∣tr(ˆΘ(1,2)gES(ˆS(2)n2−Σ))∣∣≤∥ˆS(2)n2−Σ∥⋅tr(ˆΘ(1,2)gES). (49)

Then we write the inequality (47) as

 KL(ˆΘ(1,2)gES)≤ KL(ˆΘ(1)~m)+2n2log1π~m+tr(ˆΘ(1)~m(ˆS(2)n2−Σ))+∥ˆS(2)n2−Σ∥⋅tr(ˆΘ(1,2)gES). (50)

According to the definition of as in (39), the proposition then follows. ∎

### a.2 Proof of Theorem 3.1.

###### Proof.

Since is any sparsity pattern attaining then

 KL(ˆΘ(1,2)gES)≤ KL(ˆΘ(1)m∗)+2n2log1πm∗+tr(ˆΘ(1)m∗(ˆS(2)n2−Σ))+∥ˆS(2)n2−Σ∥⋅tr(ˆΘ(1,2)gES) (51) = minm∈MKL(ˆΘ(1)m)+2n2log1πm∗+tr(ˆΘ(1)m∗(ˆS(2)n2−Σ))+∥ˆS(2)n2−Σ∥⋅tr(ˆΘ(1,2)gES). (52)

The normalization factor of the prior probability in Definition 2.2 satisfies the following inequality

 H =∑m∈M(|m|1ep(p−1))|m|1 (53) ≤p(p−1)/2∑k=0(p(p−1)/2k)(kep(p−1))k (54) ≤p(p−1)/2∑k=0(ep(p−1)/2k)k(kep(p−1))k (55) ≤2. (56)

Then for any we have

 log(1πm) ≤|m|1log(ep(p−1)|m|1∨1)+log2. (57)

Thus we obtain

 2n2log1πm∗=OP(s∗logpn2), (58)

where is the number of nonzero off-diagonal elements of .

Note that

 tr(ˆΘ(1)m