 # Maximum Pairwise Bayes Factors for Covariance Structure Testing

Hypothesis testing of structure in covariance matrices is of significant importance, but faces great challenges in high-dimensional settings. Although consistent frequentist one-sample covariance tests have been proposed, there is a lack of simple, computationally scalable, and theoretically sound Bayesian testing methods for large covariance matrices. Motivated by this gap and by the need for tests that are powerful against sparse alternatives, we propose a novel testing framework based on the maximum pairwise Bayes factor. Our initial focus is on one-sample covariance testing; the proposed test can optimally distinguish null and alternative hypotheses in a frequentist asymptotic sense. We then propose diagonal tests and a scalable covariance graph selection procedure that are shown to be consistent. A simulation study evaluates the proposed approach relative to competitors. The performance of our graph selection method is demonstrated through applications to a sonar data set.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a sample of observations from a high-dimensional normal model in which the number of variables can grow with the sample size ,

 X1,…,Xn∣Σn i.i.d.∼ Np(0,Σn), (1)

where is a covariance matrix. There is often interest in inferring the structure in and in comparing different alternative covariance structures. This article focuses on this problem from a hypothesis testing perspective. Let be the data matrix. A one-sample covariance test can be reduced to the following simple form:

 H0:Σn=IpversusH1:Σn≠Ip,

by noting that

is equivalent to a null hypothesis

for any given positive definite matrix

by applying the linear transformation

.

Another important problem is testing diagonality

 H0:σij=0 for any i≠j versus H1: not H0,

where . Finally, we consider the problem of support recovery

, corresponding to estimating the nonzero elements of covariance matrices.

We are interested in constructing novel Bayesian tests that are practically applicable with theoretical guarantees for the (i) one-sample covariance test, (ii) diagonality test, and (iii) support recovery of the covariance matrix. Throughout the paper, we consider the high-dimensional setting in which the number of variables can grow to infinity as the sample size gets larger and possibly be much larger than . Although it is well known that assuming a restricted covariance class is necessary for consistent estimation of large covariance matrices (Johnstone and Lu, 2009; Lee and Lee, 2018), in a testing context we focus on alternative hypotheses that are unconstrained. One natural possibility is to assume a conjugate inverse-Wishart prior for under

. However, in order for the resulting posterior to be proper, it is necessary to choose the degrees of freedom

, suggesting an extremely informative prior in high-dimensional settings. The resulting test will certainly be highly sensitive to the choice of

, and hence is not very useful outside of narrow applications having substantial prior information. One could instead choose a non-conjugate prior for

under , but then substantial computational issues arise in attempting to estimate the Bayes factor.

From a frequentist perspective, Chen et al. (2010) and Cai and Ma (2013)

suggested consistent one-sample covariance tests based on unbiased estimators of

, where is the Frobenius norm of a matrix

. Under the null hypothesis, they showed that their test statistic is asymptotically normal. The test also has power tending to one as

goes to infinity, but it requires the condition, as . This condition implies that they essentially adopted for some as as the alternative class. Cai and Ma (2013) proved that if we consider an alternative class , the condition is inevitable for any level test to have power tending to one. This excludes cases in which a finite number of the components of have a magnitude , although can be a significant signal when .

The above discussion motivates us to develop hypothesis tests that are easy to implement in practice while possessing theory guarantees. In particular, we wish to construct tests that can perform well (e.g., a consistent test) even when the condition fails to hold. We achieve this by proposing a novel Bayesian testing framework based on the maximum pairwise Bayes factor (mxPBF) which will be introduced in Section 2.2. The basic strategy is to focus on the pairwise difference between and rather than the Frobenius norm or other matrix norms. More precisely, instead of considering a usual Bayes factor based on a prior on the whole covariance matrix, we first consider the pairwise Bayes factors for each element of the matrix and combine them by taking a maximum over all possible pairs. This approach enables us to consider a different alternative class, for some constant , where for a matrix . When the primary interest is not on a collection of very weak signals, but on detecting at least one meaningful signal, our test based on mxPBF is much more effective than the frequentist methods mentioned above.

The main contributions of this paper are as follows. The mxPBF is a general theoreticall-supported method with a simple implementation that can be readily used by practitioners. mxPBF is also the first Bayesian test that has been shown to be consistent in the high-dimensional setting for the one-sample or diagonal covariance testing problems. In the one-sample case, the proposed test is rate-optimal in the sense that it can distinguish the elementwise maximum norm-based alternative class from the null with the fastest rate of , while guaranteeing consistency under the null. We also propose a scalable graph selection method for high-dimensional covariance graph models using pairwise Bayes factors. The proposed method consistently recovers the true covariance graph structure under a weaker or comparable condition to those in the existing frequentist literature.

The rest of the paper is organized as follows. Section 2 introduces the notations and definition of the mxPBF. Section 3 presents the main results of the paper: the one-sample covariance test, diagonality test and support recovery of the covariance matrix. In Section 4 the practical performance of the proposed methods is evaluated based on simulation study and real data analysis. R codes for implements of our empirical results are available at https://github.com/leekjstat/mxPBF. Concluding remarks are given in Section 5. Proofs of our main results are included in an Appendix, with additional results in Supplemental Materials.

## 2 Preliminaries

### 2.1 Notations

For any real values and , we denote as the maximum between and . For any positive sequences and , we denote or if as

. For any vector

, we define the vector - and -norm as and , respectively. Let be the set of all positive definite matrices. We denote

as the non-central chi-square distribution with degrees of freedom

and non-centrality , and let . For positive real values and ,

denotes the inverse gamma distribution with shape

and scale .

### 2.2 Maximum Pairwise Bayes Factor (mxPBF)

In this subsection, we introduce the mxPBF focusing on the one-sample covariance test. As described before, the basic strategy is to concentrate on the pairwise difference between and . Let be the th column vector of . For any indices and

, based on the joint distribution (

1), the conditional distribution of given is

 ~Xi∣~Xj ∼ Nn(aij~Xj,τ2ijIn), (2)

where and . We can view model (2

) as a linear regression model given a design matrix

. For each paired conditional model (2), we consider a testing problem

 H0,ij:aij=0,τ2ij=1versusH1,ij: not H0,ij. (3)

If is true, and because and , where and are covariance and correlation matrices, respectively. We suggest the following prior distribution under the alternative hypothesis in (3),

 aij∣τ2ij∼N(0,τ2ijγ∥~Xj∥−12),τ2ij∼IG(a0,b0), (4)

where and and are positive constants. The induced Bayes factor is

 B10(~Xi,~Xj) := p(~Xi∣~Xj,H1,ij)p(~Xi∣~Xj,H0,ij) = ba00Γ(a0)(γ1+γ)1/2Γ(n2+a0)enˆτ2i/2(n2ˆτ2ij,γ+b0)−n/2−a0,

where , and .

The null hypothesis in the one-sample covariance test, , is true if is true for all pairs such that . We aggregate the information from each pairwise Bayes factor via the maximum pairwise Bayes factor (mxPBF),

 Bmax,10(Xn) := maxi≠jB10(~Xi,~Xj). (5)

A large value for provides evidence supporting the alternative hypothesis. By taking a maximum, the mxPBF supports the alternative hypothesis if at least one of the pairwise Bayes factors support the alternative. A natural question is whether false positives increase as we take a maximum over more and more pairs. Indeed, we find that this is not the case, either asymptotically based on our consistency results (Theorems 3.1 and 3.3) or in finite samples based on simulations.

## 3 Main Results

### 3.1 One-sample Covariance Test

In this subsection, we show consistency of the mxPBF defined in (5) for the one-sample covariance test

 H0:Σn=IpversusH1:Σn≠Ip. (6)

We first introduce assumptions for consistency under . Let be the true covariance matrix, implying the conditional distribution of given is

 ~Xi∣~Xj ∼ Nn(a0,ij~Xj,τ20,ijIn) (7)

under , where , and

is the probability measure corresponding to model (

1) with . Note that if and only if . Under the true alternative , we assume that satisfies at least one of the following conditions:

• There exists an index satisfying

 ∣∣σ0,ii−1∣∣ ≥ (4σ0,ii√C1+C2+2b0√nlog(n∨p))√log(n∨p)n (8)

for some constants and .

• There exists a pair satisfying

 ∣∣τ20,ij−1∣∣ ≥ (4τ20,ij√C1+C2+2b0+τ20,ij√nlog(n∨p))√log(n∨p)n. (9)
• There exists a pair satisfying

 σ20,ij ≥ σ0,jj1−2√C1ϵ0{9C1τ20,ij(1−C3)2∨C4(α+2)C3}log(n∨p)n (10)

for some constants and .

Throughout the paper, and are fixed global constants. For a given small constant , they can be considered as and .

Condition (A1) is required to detect a non-unit variance

. The condition (8) can be interpreted as a beta-min condition for . The beta-min condition gives a lower bound for nonzero parameters and is essential for model selection consistency (Castillo et al., 2015; Martin et al., 2017). Interestingly, the rate of lower bound in (A1) is given by , which has been commonly used in the variable selection literature. Condition (A2) is similar to condition (A1), which can be interpreted as a beta-min condition for . Condition (A3) is also the beta-min condition for off-diagonal elements of the covariance matrix. The right hand side of (10) implies that the larger and are, the larger value of is required to consistently detect the dependency between and . Based on (7), the former means that larger value of residual variance makes the inference harder. The conditional expectation in (7) can be considered as , where . Thus, a large value of makes it hard to detect nonzero . In summary, conditions (A1)–(A3) imply that

 Σ0 ∈ H1:={Σn:∥Σn−Ip∥2max≥C⋅logpn}

for some constant , which corresponds to the meaningful difference we mentioned earlier. In fact, the rate is optimal to guarantee the consistency under both hypotheses (Theorem 3.2).

Theorem 3.1 shows that the mxPBF is consistent for the one-sample covariance test even in the high-dimensional setting as long as for some small constant .

###### Theorem 3.1

Consider model (1) and the one-sample covariance testing problem versus . Consider prior (4) under in (3) with and for some small constant . Assume that for all large , and satisfies at least one of conditions (A1)–(A3) or . Then, the mxPBF defined in (5) is consistent under . Moreover, under , we have

 −logBmax,10(Xn) = Op(log(n∨p)),

and under ,

 logBmax,10(Xn) = Op(log(n∨p)).

We first prove that the pairwise Bayes factor is consistent on a large event such that as . To show the consistency of the mxPBF under , we need to prove that as , which means that the false discovery rate converges to zero. The condition for in Theorem 3.1 is closely related to this requirement. To show the consistency of the mxPBF under , it suffices to show as for some index satisfying at least one of conditions (A1)–(A3). Interestingly, the rate of convergence is similar under both hypotheses, unlike most Bayesian testing procedures with the notable exception of non-local prior based methods (Johnson and Rossell, 2010, 2012).

• Compared to existing frequentist one-sample covariance tests in Chen et al. (2010) and Cai and Ma (2013), our proposed test can detect if there is at least one meaningful nonzero entry in , i.e. for some constant . For example, suppose and for some constant , which implies . Then, the mxPBF consistently detects the true . However, the powers of the frequentist tests do not tend to 1 in this case, because the condition is not met.

The next theorem shows the optimality of the alternative class which is considered in Theorem 3.1 (Conditions (A1)–(A3)). It says, when the alternative class is defined based on the element-wise maximum norm, the condition for some constant is necessary for any consistent test to have power tending to one. Thus, conditions (A1)–(A3) are rate-optimal to guarantee the consistency under as well as .

###### Theorem 3.2

Let be the expectation corresponding to model (1). For a given constant , define

 H1(C⋆) = {Σ∈Cp:∥Σ−Ip∥2max≥C2⋆logpn}.

If , then for any consistent test such that as ,

 limsupn→∞infΣ∈H1(C⋆)EΣϕ≤12.

### 3.2 Testing Diagonality

We now extend mxPBF to test diagonality of the covariance matrix:

 H0:σij=0 for any i≠j versus H1: not H0, (11)

where . The above hypothesis testing problem can be modularized into many pairwise independence tests

 H0,ij:σij=0 versus H1,ij:σij≠0 (12)

for all . We can adopt the mxPBF idea to aggregate the pairwise testing information from (12) for all possible pairs such that to test (11). Based on the conditional distribution (2), the null hypothesis in (12) is equivalent to . We suggest the prior under and prior (4) under , which leads to the pairwise Bayes factor

 ~B10(~Xi,~Xj) = (γ1+γ)1/2(nˆτ2ij,γ+2b0nˆτ2i+2b0)−n/2−a0.

We suggest using the mxPBF,

 ~Bmax,10(Xn) := maxi≠j~B10(~Xi,~Xj), (13)

for the hypothesis testing problem (11). Theorem 3.3 states the consistency of the mxPBF for testing (11) under regularity conditions. For consistency under the alternative hypothesis, we assume the following condition:
(A4) There exists a pair satisfying

 σ20,ij ≥ C4σ0,jj1−2ϵ0√C1{9C1τ20,ij(1−C3)2∨α(1+γ){(1+4ϵ0√C1)σ0,ii+2b0n−1}C3}log(n∨p)n

for constants and defined in Section 3.1.

###### Theorem 3.3

Consider model (1) and the diagonality testing problem (11). For a given pair such that , consider the prior under and prior (4) under in (12) with for some small constant . Assume that for all large , and satisfies condition (A4) or . Then the mxPBF defined in (13) is consistent under . Moreover, under ,

 −log~Bmax,10(~Xi,~Xj) = Op(log(n∨p)),

and under ,

 log~Bmax,10(~Xi,~Xj) = Op(log(n∨p)).

Condition (A4) is the beta-min condition for off-diagonal elements of the true covariance matrix. It indicates that if one of the off-diagonal elements satisfies the beta-min condition (A4), the mxPBF consistently detects the true alternative hypothesis. Similar to Theorem 3.1, the condition for is required to control the false discovery rate, and the mxPBF has similar rates of convergence under both hypotheses.

As a by-product, when pairwise independence testing (12) itself is of interest, we suggest a pairwise Bayes factor

 ~Bpair,10(~Xi,~Xj) := ~B10(~Xi,~Xj)∨~B10(~Xj,~Xi), (14)

which can be shown to be consistent. Note that decisions from tests based on and can differ in finite samples, which is undesirable. To obtain an order-invariant test, we suggest using the pairwise Bayes factor (14). For consistency under the alternative hypothesis, we assume

 σ20,ij ≥ C4σ0,jj1−2ϵ0√C1{9C1τ20,ij(1−C3)2∨α(1+γ){(1+4ϵ0√C1)σ0,ii+2b0n−1}C3}lognn (15)

for constants and defined in Section 3.1. If we substitute in the above condition with , it coincides with condition (A4). The proof of Corollary 3.4 is obvious from that of Theorem 3.3, thus it is omitted.

###### Corollary 3.4

Consider model (1) and a hypothesis testing problem (12) for a given pair such that . Suppose we use prior under and prior (4) under with for some positive constants and . Assume that either or at least one of and satisfy (15). Then, the pairwise Bayes factor is consistent under . Moreover, under , we have

 −log~Bpair,10(~Xi,~Xj) = Op(logn),

and under ,

 log~Bpair,10(~Xi,~Xj) = Op(logn).

### 3.3 Support Recovery of Covariance Matrices

The primary interest of this section is on the recovery of , where is the nonzero index set of the true covariance matrix . We call the support of . Estimating corresponds to graph selection in covariance graph models (Cox and Wermuth, 1993). Despite its importance, few Bayesian articles have investigated this problem. Kundu et al. (2013) proposed the regularized inverse Wishart (RIW) prior, which can be viewed as a group Lasso penalty (Yuan and Lin, 2006) on the Cholesky factor. They showed the consistency of their selection procedure for the support of precision matrices when the dimension is fixed. Recently, Gan et al. (2018) adopted the spike-and-slab Lasso prior (Ročková and George, 2016; Ročková et al., 2018) for off-diagonal entries of the precision matrix. Their proposed graph selection procedure for the precision matrix also yields selection consistency. To the best of our knowledge, in the Bayesian literature, a consistent support recovery result for covariance matrices has not been established.

To tackle this gap, we propose a scalable graph selection scheme for high-dimensional covariance matrices based on pairwise Bayes factors. By looking closer at the proof of Theorem 3.3, it reveals that each pairwise Bayes factor can consistently determine whether the corresponding covariance element is zero or not. Thus, we suggest using the estimated index set

 ˆSpair = {(i,j):2log~Bpair,10(~Xi,~Xj)>Csel,1≤i

for some constant . Asymptotically, any constant can be used for consistent selection. However, in practice, we suggest using the threshold , which corresponds to ‘very strong evidence’ under the Bayes factor thresholds of Kass and Raftery (1995). In the frequentist literature, Drton and Perlman (2004) and Drton and Perlman (2007) proposed selection procedures using a related idea to (16), which select a graph by multiple hypothesis testing on each edge. However, they considered only the low-dimensional setting, .

For the consistency of , we introduce the following condition for some constants and :
(A.ij) For a given pair such that ,

 σ20,ij ≥ C4σ0,jj1−2ϵ0√C5{9C5τ20,ij(1−C3)2∨α(1+γ){(1+4ϵ0√C5)σ0,ii+2b0n−1}C3}log(n∨p)n.

The beta-min condition (A.ij) is almost the same with (A4) except using instead of to control the probabilities of small events on which the pairwise Bayes factor might not be consistent. Theorem 3.5 states that (16) achieves model selection consistency if at least conditions (A.ij) or (A.ji) hold for all .

###### Theorem 3.5

Consider model (1) and prior (4) with for some small constant and each pair such that . Assume that for all large and at least one of conditions (A.ij) and (A.ji) hold for all . Then, we have

 limn→∞P0(ˆSpair=S(Σ0)) = 1

We note that consistently recovers the support of the true covariance matrix regardless of the true sparsity as long as and nonzero entries satisfy the beta-min condition (A.ij). Rothman et al. (2009) proved a similar support recovery result for generalized thresholding of the sample covariance matrix while assuming , for some and for some sufficiently large . Cai and Liu (2011) assumed and for some and obtained the consistent support recovery result of the covariance matrix using adaptive thresholding. In terms of the condition on and , our condition, , is much weaker than the conditions used in the literature. The beta-min condition (A.ij) is similar to that in Cai and Liu (2011) and also has the same rate to that in Rothman et al. (2009) if we assume for some . Thus, the required condition in Theorem 3.5 is weaker or comparable to the conditions used in the literature.

## 4 Numerical Results

### 4.1 Simulation Study: One-sample Covariance Test

In this section, we demonstrate the performance of our one-sample covariance test in various simulation cases. The four hyperparameters were chosen as

and , which satisfy the sufficient conditions in Theorem 3.1. If we assume a small , the above choice of asymptotically satisfies . We compare our one-sample covariance test with existing frequentist tests, proposed by Cai and Ma (2013) and Srivastava et al. (2014). The test suggested by Srivastava et al. (2014) is based on estimating the squared Frobenius norm, and has a similar perspective to the test proposed by Cai and Ma (2013). The test statistic is asymptotically normal if for some , but does not have any theoretical guarantee on power.

We generated 100 data sets from each hypothesis and , for various choices of and . We considered two structures for the alternative hypothesis . First, we chose a tridiagonal matrix with

 σ0,ij = I(i=j)+ρ⋅I(|i−j|=1) (17)

for some signal strength constant ranging from to by . For each value of

, we generated 100 data sets from the multivariate normal distribution

. As a second case for , we let

 σ0,ij = I(i=j)+ρ⋅I(i=1,j=2)+ρ⋅I(i=2,j=1), (18)

for some constant . Because (18) has signals at only two locations, the difference between and is sparse. mxPBF and the frequentist tests require and , respectively for consistency, while we have for (18). Hence, mxPBF is potentially much more powerful than the tests in Cai and Ma (2013) and Srivastava et al. (2014) in this setting. To check this conjecture, we increased from to by and generated 100 simulated data from .

We calculated receiver operating characteristic (ROC) curves to illustrate and compare the performance of the tests. For each setting, points of the ROC curves were obtained based on various thresholds and significance levels for mxPBF and frequentist tests, respectively. Figure 1: ROC curves are represented for the three tests based on 100 simulated data sets for each hypothesis H0 and H1, where (17) was used for H1. mxPBF, CM and SYK represent the test proposed in this paper, Cai and Ma (2013) and Srivastava et al. (2014), respectively. Figure 2: ROC curves are represented for the three tests based on 100 simulated data sets for each hypothesis H0 and H1, where (18) was used for H1. mxPBF, CM and SYK represent the test proposed in this paper, Cai and Ma (2013) and Srivastava et al. (2014), respectively.

We tried and for each setting. Figure 1 shows ROC curves based on 100 simulated data from and 100 simulated data from with a tridiagonal given in (17), when and . In this setting, the tests in Cai and Ma (2013) and Srivastava et al. (2014) seem to work better than the test proposed in this paper. This is as expected since these earlier tests are based on the squared Frobenius norm , while mxPBF focuses on the maximum difference . Thus, when the difference between the true covariance and the point null is dense, the test based on the Frobenius norm difference would be more powerful. Figure 2 shows ROC curves based on 100 simulated data from and 100 simulated data from with a tridiagonal given in (18), when and . As expected, the mxPBF is much more powerful than the frequentist tests when is sparse. The mxPBF test works reasonably well when the signal is larger than , but the frequentist tests do not work well even when is . Furthermore, unlike the previous result described in Figure 1, the performance of the frequentist tests is not getting better as and increase, while mxPBF has better performance when than . It confirms the theoretical property that the tests in Cai and Ma (2013) and Srivastava et al. (2014) require to have power tending to one. Note that remains the same while increases as the setting is changed from to .

### 4.2 Simulation Study: Testing Diagonality

We conducted a simulation study to illustrate the performance of the diagonality test based on the mxPBF. The hyperparameters in the mxPBF were chosen as in section 4.1 except . For each hypothesis for any and not , 100 data sets were generated. The two structures of under used in the previous section, (17) and (18), were considered.

We compare our test with frequentist tests suggested by Cai and Jiang (2011) and Lan et al. (2015). Cai and Jiang (2011) proposed a diagonality test and showed its asymptotic null distribution in the high-dimensional setting. Their test is based on the maximum of sample correlations between variables. Note that in the pairwise Bayes factor is a decreasing function of the sample correlation between and . Thus, their test has a similar aspect to our test. Lan et al. (2015) developed a test in the regression setting and showed asymptotic normality. Their test statistic is based on the squared Frobenius norm of a sample covariance matrix, so it is expected that it will have a high power when off-diagonal entries of the true covariance matrix are dense. Figure 3: ROC curves are represented for the three tests based on 100 simulated data sets for each hypothesis H0 and H1. Type 1 and Type 2 mean that the true covariance matrix Σ0 under H1 were generated from (17) and (18), respectively. mxPBF, Lan and CJ represent the test proposed in this paper, Lan et al. (2015) and Cai and Jiang (2011), respectively.

The results for the case are reported in Figure 3. ROC curves of tests based on the mxPBF and proposed by Cai and Jiang (2011) are almost identical in every setting. As expected, the test suggested by Lan et al. (2015) is more powerful when is dense, which corresponds to “Type 1” in Figure 3. The tests proposed in this paper and Cai and Jiang (2011) have less power, but also work reasonably well when the signal is larger than . On the other hand, they tend to have a high power when is sparse, which corresponds to “Type 2” in Figure 3. The ROC curve of the test of Lan et al. (2015) is close to the straight line, so it does not work well in this setting; similar behavior occurs even when is .

### 4.3 Support Recovery using Sonar Data

To describe the practical performance of the support recovery procedure (16), , we analyzed the sonar data set used by Gorman and Sejnowski (1988). The data record sonar signals collected from a rock and a metal cylinder, which were both approximately 5 feet long and embedded in the seabed. The impinging pulse sent to the targets was a wide-band linear frequency-modulated (FM) chirp (ka = 55.6). Data were collected within 10 meters of each object at various angles, spanning degrees for the rock and degrees for the metal cylinder. The rock data contains returned patterns, while the metal data contains returns. Each pattern consists of values supported in the range