Estimating a large covariance or precision matrix is a challenging task in both frequentist and Bayesian frameworks. When the number of variables is larger than the sample size , the traditional sample covariance matrix does not provide a consistent estimate of the true covariance matrix (Johnstone and Lu, 2009), and the inverse Wishart prior leads to the posterior inconsistency (Lee and Lee, 2018). To overcome this issue, various restricted classes of matrices have been investigated such as the bandable matrices (Bickel and Levina, 2008; Cai et al., 2010; Hu and Negahban, 2017; Banerjee and Ghosal, 2014; Lee and Lee, 2017), sparse matrices (Cai and Zhou, 2012a; Banerjee and Ghosal, 2015; Xiang et al., 2015; Cao et al., 2017) and low-dimensional structural matrices (Fan et al., 2008; Cai et al., 2015; Pati et al., 2014; Gao and Zhou, 2015). In this paper, we focus on banded precision matrices, where the banded structure is encoded via the Cholesky factor of the precision matrix. We are in particular interested in the estimation of the bandwidth parameter and construction of Bayesian bandwidth tests for one or two banded precision matrices. Inference of the bandwidth is of great importance for detecting the dependence structure of the ordered data. Moreover, it is a crucial initial step for subsequent analysis such as linear or quadratic discriminant analysis.
Bandwidth selection of the high-dimensional precision matrices has received increasing attention in recent years. An et al. (2014)
proposed a test for bandwidth selection, which is asymptotically normal under the null hypothesis and has a power tending to one. Based on the proposed test statistics, they constructed a backward procedure to detect the true bandwidth by controlling the familywise errors.Cheng et al. (2017) suggested a bandwidth test without assuming any specific parametric distribution for the data and obtained a result similar to that of An et al. (2014).
In the Bayesian literature, Banerjee and Ghosal (2014) studied the estimation of bandable precision matrices which include the banded precision matrix as a special case. They derived the posterior convergence rate of the precision matrix under the -Wishart prior (Roverato, 2000). Lee and Lee (2017) considered a similar class to that of Banerjee and Ghosal (2014), but assumed bandable Cholesky factors instead of bandable precision matrices. They showed the posterior convergence rates of the precision matrix as well as the minimax lower bounds. In both works, posterior convergence rates were obtained for a given (fixed) bandwidth, and the posterior mode was suggested as a bandwidth estimator in practice. However, no theoretical guarantee is provided for such estimators. Further, no Bayesian bandwidth test exists for one- or two-sample problems.
This gap in the literature motivates us to investigate theoretical properties related to the general problem of bandwidth test and selection, and propose estimators or tests with theoretical guarantees. In this paper, we use the modified Cholesky decomposition of the precision matrix and assume banded Cholesky factors. The induced precision matrix also has banded structure. The key difference from Lee and Lee (2017) is on the choice of prior distributions which will be introduced in Section 2.3. In addition, we focus on bandwidth selection and tests, while Lee and Lee (2017) mainly studied the convergence rates of the precision matrix for a given or fixed bandwidth.
There are two main contributions of this paper. First, we suggest a Bayesian procedure for banded precision matrices and prove the bandwidth selection consistency (Theorem 3.1) and consistency of the Bayes factor (Theorem 3.2
). To the best of our knowledge, our work is the first that has established the bandwidth selection consistency for precision matrices under a Bayesian framework, which implies that the marginal posterior probability for the true bandwidth tends to one as.Cao et al. (2017) proved strong model selection consistency for the sparse directed acyclic graph models, but their method is not applicable to the bandwidth selection problem since it is not adaptive to the unknown sparsity. Second, we also prove the consistency of the Bayes factor for two-sample bandwidth testing problem (Theorem 3.3) and derived the convergence rates of the Bayes factor under both the null and alternative hypotheses. Our method is able to consistently detect the equality of bandwidths between two different precision matrices. To the best of our knowledge, this is also the first consistent two-sample bandwidth test result in both frequentist and Bayesian literature. The existing literature (frequentist) focused only on the one-sample bandwidth testing (An et al., 2014; Cheng et al., 2017).
The rest of the paper is organized as follows. Section 2 introduces the notations, model, priors and assumptions used. Section 3 describes main results of this paper: bandwidth selection consistency and convergence rates of one- and two-sample bandwidth tests. Simulation study and real data analysis are presented in Section 4 to show the practical performance of the proposed method. In Section 5, concluding remarks and topics for the future work are given. The appendix includes a result on the nearly optimal estimation of the Cholesky factors, and proofs of main results.
For any real numbers and , we denote and as the minimum and maximum of and , respectively. For any sequences and , we denote if as . We write , or , if there exists an universal constant such that for any
. We define vector- and -norms as and for any . For a matrix , the matrix -norm is defined as . We denote and
as the minimum and maximum eigenvalues of, respectively.
2.2 Gaussian DAG Models
When the random variables have a natural ordering, one common approach for the estimation of high-dimensional covariance (or precision) matrices is to adopt banded structures. One popular model to incorporate banded structure is a Gaussian directed acyclic graph (DAG) model in which the bandwidth can be encoded by the Cholesky factor via the modified Cholesky decomposition (MCD) below.
A directed graph consists of vertices and directed edges . For any , we denote as a directed edge and call a parent of . A DAG is a directed graph with no directed cycle. In this paper, we assume a parent ordering is known, where holds for any parent of in a DAG
, which has been commonly used in the literature. A multivariate normal distributionis said to be a Gaussian DAG model over , if satisfies
for any , where is the set of all parents of .
We consider a Gaussian DAG model over some graph ,
where is a precision matrix and for all . For any positive definite matrix , there exist unique lower triangular matrix and diagonal matrix such that
where and for all , by the MCD. We call the Cholesky factor. It can be easily shown that if and only if , so the Cholesky factor uniquely determines a DAG . Define as the bandwidth of a matrix if the off-diagonal elements of the matrix farther than from the diagonal are all zero. If the bandwidth of the Cholesky factor is , model (1) can be represented as
for all , where , and
. The above representation enables us to adopt priors and techniques in the linear regression literature.
We are interested in the consistent estimation and hypothesis test of the bandwidth of the precision matrix. From the decomposition (2), the bandwidth of is if and only if the bandwidth of is . Thus, we can infer the bandwidth of the precision matrix by inferring that of the Cholesky factor.
2.3 Prior Distribution
Let and be sub-matrices consisting of th and th columns of , respectively. We suggest the following prior distribution
for some positive constants and positive sequence , where . The conditional prior distribution for is a version of the Zellner’s -prior (Zellner, 1986; Martin et al., 2017) in the linear regression literature. Note that model (3) is equivalent to . Due to the conjugacy, it enables us to calculate the posterior distribution in a closed form up to some normalizing constant. The prior for is carefully chosen to reduce the posterior mass towards large bandwidth . We emphasize here that one can use the usual non-informative prior , but necessary conditions for the main results in Section 3 should be changed. This issue will be discussed in more details in the next paragraph. We assume the prior to have the support on . We will introduce condition (A4) for
and the hyperparameters in Section2.4, and show that is enough to establish the main results in Section 3. The priors (4)–(6) lead to the following joint posterior distribution,
provided that , where and . The marginal posterior consists of two parts: the penalty on the model size,
, and the estimated residual variances,. Thus, priors (4) and (5) naturally impose the penalty term for the marginal posterior .
The effect of prior appears in marginal posterior for . Compared with the prior , it produces the term instead of . Thus, it reduces the posterior mass towards large bandwidth since decreases as grows. We conjecture that, at least for our prior choice of with a constant , this power adjustment of is essential to prove the selection consistency for . Suppose we use the prior . Similar to the proof of Theorem 3.1, to obtain the selection consistency, we will use the inequality
and show that the expectation of the right hand side term converges to zero for any as , where is the true bandwidth. Note that unless shrinks to zero, the inequality causes only a constant multiplication. The most important task is dealing with the last term in (8), . Concentration inequalities for chi-square random variables (for examples, see Lemma 3 in Yang et al. (2016) and Lemma 4 in Shin et al. (2015)) suggest an upper bound with high probability for any , and some constant . In this case, the hyperparameter should be of order for some constant to make the right hand side in (8) converge to zero. Then, with the choice , condition (A2), which will be introduced in Section 2.4, should be modified by replacing with to achieve the selection consistency. In summary, the main results in this paper still hold for the prior , but it requires stronger conditions due to technical reasons. We state the results using prior (5) to emphasize that the bandwidth selection problem essentially requires weaker condition than the usual model selection problem.
If we adopt the fractional likelihood (Martin et al., 2017), we can achieve the selection consistency (Theorem 3.1) with the prior instead of (5) under similar conditions in Theorem 3.1. However, with the fractional likelihood, we cannot calculate the Bayes factor which is essential to describe the Bayesian test results in Sections 3.2 and 3.3.
Throughout the paper, we consider the high-dimensional setting where the number of variables increases to the infinity as .
We denote as the true precision matrix whose MCD is given by .
Let and be the probability measure and expectation corresponding to model (1) with .
For the true Cholesky factor , we denote as the true bandwidth.
We introduce conditions (A1)–(A4) for the true precision matrix and priors (4)–(6):
(A1). There exist positive sequences and such that for every and .
(A2). For a given positive constant in prior (5), there exists a positive constant such that for every ,
where and .
Now, let us describe the above conditions in more detail. The bounded eigenvalue condition for the true precision matrix is common in the high-dimensional precision matrix literature (Banerjee and Ghosal, 2014, 2015; Xiang et al., 2015; Ren et al., 2015). We allow that and as , so condition (A1) is much weaker than the condition in the above literature, which assumes for some small constant . Cao et al. (2017) also allowed diverging bounds, but assumed that and for some . If we assume that , then condition (A1) implies that , which is much weaker than the condition used in Cao et al. (2017).
Condition (A2) is called the beta-min condition. If we assume that and , in our model it only requires the lower bound of the nonzero elements to be of order . In the sparse regression coefficient literature, the lower bound of the nonzero coefficients is usually assumed to be up to some constant (Castillo et al., 2015; Yang et al., 2016; Martin et al., 2017). Here, the term can be interpreted as a price coming from the absence of information on the zero-pattern. Condition (A2) reveals the fact that, under the banded assumption, we do not need to pay this price anymore.
Condition (A3) ensures that the true bandwidth lies in the support of . The condition is needed for the selection consistency, which holds if we choose for some large constant . Although this is slightly stronger than the condition in An et al. (2014), it is much weaker than those in other works. For examples, Banerjee and Ghosal (2014) assumed for the consistent estimation of precision matrix, and Cheng et al. (2017) assumed for theoretical properties.
with , it satisfies the conditions. Furthermore, if we choose , which leads to
the conditions are met if and .
In the sparse linear regression literature, a common choice for the prior on the unknown sparsity is for some constant . See Castillo et al. (2015), Yang et al. (2016) and Martin et al. (2017). If we adopt this type of the prior into the bandwidth selection problem, a naive approach is using for each row of the Cholesky factor: it results in . To obtain the strong model selection consistency, in this case, in condition (A2) has to be for some constant . Thus, it unnecessarily requires stronger beta-min condition, which can be avoided by using like (11) or (12).
3 Main Results
3.1 Bandwidth Selection Consistency
When there is a natural ordering in the data set, estimating the bandwidth of the precision matrix is important for detecting the dependence structure. It is a crucial first step for the subsequent analysis. In this subsection, we show the bandwidth selection consistency of the proposed prior. Theorem 3.1 states that the posterior distribution puts a mass tending to one at the true bandwidth . Thus, we can detect the true bandwidth using the marginal posterior distribution for the bandwidth . We call this property the bandwidth selection consistency.
Informed readers might be aware of the recent work of Cao et al. (2017) considering the sparse DAG models. It should be noted that their method is not applicable to the bandwidth selection problem. The key issue is that their method is not adaptive to the unknown sparsity corresponding to the true bandwidth in this paper: to obtain the selection consistency, the choice of hyperparameter should depend on , which is unknown and of interest. Furthermore, they required stronger conditions in terms of dimensionality , true sparsity , eigenvalues of the true precision matrix and beta-min for the strong model selection consistency.
The bandwidth selection result does not necessarily imply the consistency of the Bayes factor. Note that prior (4), and
and priors (4), (5) and lead to the same marginal posterior for . Thus, the above priors also achieve the bandwidth selection consistency in Theorem 3.1. However, (13) might be inappropriate when the Bayes factor is of interest, because the ratio of normalizing terms induced by prior (13) ( and in (14)) have a non-ignorable effect on the Bayes factor.
3.2 Consistency of One-Sample Bandwidth Test
In this subsection, we focus on constructing a Bayesian bandwidth test for the testing problem versus for some given . A Bayesian hypothesis test is based on the Bayes factor defined by the ratio of marginal likelihoods,
We are interested in the consistency of the Bayes factor which is one of the most important asymptotic properties of the Bayes factor (Dass and Lee, 2004). A Bayes factor is said to be consistent if converges to zero in probability under the true null hypothesis and converges to zero in probability under the true alternative hypothesis .
Although the Bayes factor plays a crucial role in the Bayesian variable selection, its asymptotic behaviors in the high-dimensional setting are not well-understood (Moreno et al., 2010). Few works studied the consistency of the Bayes factor in the high-dimensional settings (Moreno et al., 2010; Wang and Sun, 2014; Wang et al., 2016), which only focused on the pairwise consistency of the Bayes factor. They considered the testing problem versus for any , where is the number of nonzero elements of the linear regression coefficient. Note that a Bayes factor is said to be pairwise consistent if the Bayes factor is consistent for any pair of simple hypotheses and .
We focus on the composite hypotheses and rather than simple hypotheses. To conduct a Bayesian hypothesis test, prior distributions for both hypotheses should be determined. Denote the prior under the hypothesis as for . Since the difference between two hypotheses comes only from the bandwidth, we will use the same conditional priors for and given , i.e. for , where is chosen as (4) and (5). We suggest using priors and such that
where and . Then, the Bayes factor has the following analytic form,
where the marginal posterior is given in (7) up to some normalizing constant. Note that, the Bayes factor can be defined because both hypotheses have the same improper priors on . We will show that the Bayes factor is consistent for any composite hypotheses and , which is generally stronger than the pairwise consistency of the Bayes factor. If we assume that for any and , then one can see that the consistency of the Bayes factor for hypotheses and for any implies the pairwise consistency of the Bayes factor for any pair of simple hypotheses and for .
For given positive constants , and and integers , and , define
where and are defined in condition (A4). Theorem 3.2 shows the convergence rates of Bayes factors under each hypothesis. It turns out that is sufficient for the consistency of the Bayes factor.
Note that if we use prior (11) with , the effect of the prior, , can dominate the posterior ratio, in the Bayes factor. Because the prior knowledge on the bandwidth is usually not sufficient, it is clearly undesirable. Moreover, the direction of effect is the opposite of the prior knowledge.
An et al. (2014) and Cheng et al. (2017) developed frequentist bandwidth tests for the hypotheses versus and showed that their test statistic is asymptotically normal under the null and has a power converging to one as . Compared with the result in Theorem 3.2, Cheng et al. (2017) required the upper bound for the true bandwidth , which is much stronger than our condition (A3). An et al. (2014) allowed , but assumed that the partial correlation coefficient between and given is of order . It implies that converges to zero at some rate. Thus, the nonzero elements , should converge to zero, which is somewhat unnatural.
Johnson and Rossell (2010, 2012) and Rossell and Rubio (2017) pointed out that the use of local alternative prior leads to imbalanced convergence rates for the Bayes factors, and showed that this issue can be avoided by using non-local alternative priors. However, interestingly, convergence rates for the Bayes factors in Theorem 3.2 yield similar order of rates under both hypotheses without using a non-local prior. Roughly speaking, the imbalance issue can be ameliorated by introducing the beta-min condition (Condition (A2)). To simplify the situation, consider the model
where , , , and . Suppose priors (4) and (5) are imposed on and given . Consider hypotheses and , where , and assume that the eigenvalues of are bounded and as for simplicity. Note that the prior for is a local alternative prior because on for some constant . If is true, decreases at rate for some constant based on techniques in the proof of Theorem 3.1. On the other hand, if is true, decreases exponentially with , where is the lower bound for the absolute of nonzero elements of . Johnson and Rossell (2010, 2012) and Rossell and Rubio (2017) assumed that for some constant . In that case, decreases exponentially with , which causes the imbalanced convergence rates. However, if we assume similar to condition (A2), decreases at rate for some constant . Thus, convergence rates for the Bayes factors have similar order under the both hypotheses.
The above argument does not mean that the non-local priors are not useful for our problem. We note that the balanced convergence rates by using the beta-min condition is different from those by using the non-local prior. The former makes the rate of slower under , while the latter makes the rate of faster under . Thus, the use of non-local priors might improve the rates of convergence for under in Theorem 3.2. However, it will increase the computational burden and is unclear which rate one can achieve using the non-local prior under condition (A2), so we leave it as a future work.
3.3 Consistency of Two-Sample Bandwidth Test
Suppose we have two data sets from the models
where and are the MCDs. Denote the bandwidth of as for . In this subsection, our interest is the test of equality between two bandwidths and , the two-sample bandwidth test. We consider the hypothesis testing problem versus and investigate the asymptotic behavior of the Bayes factor,
where and .
Denote the priors under and as, respectively
We suggest the following conditional priors and for any given and ,
where , is the nonzero elements in the th row of and for . Similar to the previous notations, we denote