 # Ensemble Estimation of Large Sparse Covariance Matrix Based on Modified Cholesky Decomposition

Estimation of large sparse covariance matrices is of great importance for statistical analysis, especially in the high dimensional setting. The traditional approach such as sample covariance matrix could perform poorly due to the high dimensionality. In this work, we propose a positive-definite estimator for the covariance matrix based on the modified Cholesky decomposition. The modified Cholesky decomposition relies on the order of variables, which provides the flexibility to obtain a set of covariance matrix estimates under different orders of variables. The proposed method considers an ensemble estimator as the "center" of such a set of covariance matrix estimates with respect to the Frobenius norm. The proposed estimator is not only guaranteed to be positive definite, but also can capture the underlying sparse structure of the covariance matrix. Under some weak regularity conditions, both algorithmic convergence and asymptotical convergence are established. The merits of the proposed method are illustrated through simulation studies and one real data example.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

High-dimensional data are widely occurred in modern scientific applications, such as bioinformatics, imaging recognition, weather forecasting, and financial service. Estimation of large covariance matrix from the high-dimensional data is thus an important and challenge problem in the multivariate data analysis. For example, dimension reduction using principal component analysis usually relies on accurate estimation of covariance matrix. Under the context of graphical models, the estimation of covariance matrix or its inverse is often used to infer the network structure of the graph. However, conventional estimation of covariance matrix is known to perform poorly due to the high dimension problems when the number of variables is close to or larger than the sample size (Johnstone, 2001). To overcome the curse of dimensionality, a variety of methods proposed in literature often assumes certain patterns of sparsity for the covariance matrices.

In this work, our focus is on the estimation of sparse covariance matrix for high-dimensional data. Early work on covariance matrix estimation includes shrinking eigenvalues of the sample covariance matrix (Dey and Srinivasan, 1985; Haff, 1991), a linear combination of the sample covariance and a proper diagonal matrix (Ledoit and Wolf, 2004), improving the estimation based on matrix condition number (Aubry et al., 2012; Won et al., 2013), and regularizing the eigenvectors of the matrix logarithm of the covariance matrix (Deng and Tsui, 2013). However, the above mentioned methods do not explore the sparse structure of the covariance matrix. A sparse covariance matrix estimate can be useful for subsequent inference, such as inferring the correlation pattern among the variables. Bickel and Levina (2008) proposed to threshold the small entries of the sample covariance matrix to zeroes and studied its theoretical behavior when the number of variables is large. Rothman, Levina and Zhu (2009) considered to threshold the sample covariance matrix with more general thresholding functions. Cai and Yuan (2012) proposed a covariance matrix estimation through block thresholding. Their estimator is constructed by dividing the sample covariance matrix into blocks and then simultaneously estimating the entries in a block by thresholding. However, the threshold-based estimator is not guaranteed to be positive definite. To make the estimate being sparse and positive-definite, Bien and Tibshirani (2011) considered a penalized likelihood method with a Lasso penalty (Tibshirani, 1996) on the entries of the covariance matrix. Their idea is similar to the Graphical Lasso for inverse covariance matrix estimation in the literature (Yuan and Lin, 2007; Friedman, Hastie, and Tibshirani, 2008; Rocha, Zhao, and Yu, 2008; Rothman et al., 2008; Yuan, 2008; Deng and Yuan, 2009; and Yuan, 2010), but the computation is much more complicated due to the non-convexity of the objective function. Xue, Ma and Zou (2012) developed a sparse covariance matrix estimator for high-dimensional data based on a convex objective function with positive definite constraint and

penalty. They also derived a fast algorithm to solve the constraint optimization problem. Some other work on the estimation of high-dimensional covariance matrix can be found in Fan, Liao and Mincheva (2013), Liu, Wang and Zhao (2014), Xiao et al. (2016), Cai, Ren and Zhou (2016). A comprehensive review of recent development of covariance matrix estimation can be found in Pourahamdi (2013) and Fan, Liao and Liu (2016).

Another direction of sparse covariance matrix estimation is to take advantage of matrix decomposition. One popular and effective decomposition is the modified Cholesky decomposition (MCD) (Pourahmadi, 1999; Wu and Pourahmadi, 2003; Pourahmadi, Daniels and Park, 2007; Rothman, Levina and Zhu, 2009; Dellaportas and Pourahmadi, 2012; Xue, Ma and Zou, 2012; Rajaratnam and Salzman, 2013). It assumes that the variables have a natural order based on which the variables can be sequentially orthogonalized to re-parameterize the covariance matrix. By imposing certain sparse structures on the Cholesky factor, it can result in certain sparse structure on the estimated covariance matrix. For example, Wu and Pourahmadi (2003) proposed a banded estimator of Cholesky factor, which can be obtained by regressing each variable only on its closest predecessors. Bickel and Levina (2008) showed that banding the Cholesky factor can produce a consistent estimator in the operator norm under weak conditions on the covariance matrix. Huang et al. (2006) considered to impose an (Lasso) penalty on the entries of the Cholesky factor for estimating the sparse covariance matrix. However, the MCD-based approach for estimating covariance matrix depends on the order of variables . Such an pre-specification on the order of variables may not hold in practice. A natural order of variables is often not available or cannot be pre-determined before in many applications such as the gene expression data and stock marketing data.

In this paper, we adopt the MCD approach for estimating the large covariance matrix, but alleviate the drawback of order dependency of the MCD method. Different from most existing MCD-based approaches, the proposed method takes advantage of the fact that one can always obtain a MCD-based covariance matrix estimate for a given order of variables. By considering a set of covariance matrix estimates under different orders of variables, we can obtain an ensemble estimator for the covariance matrix with positive-definite and sparse properties. Specifically, using the permutation idea, we first obtain a number of estimates of covariance matrix based on different orders of variables used in the MCD approach. With such a set of estimates, the proposed ensemble estimator is obtained as the “center” of them under the Frobenius norm through an penalized objective function. The regularization is imposed to achieve the sparsity of the estimate. Such an estimator takes advantage of multiple orders of variables due to the ensemble effort. An efficient algorithm is also developed to make the computation attractive for solving the estimator. The numerical convergence of this algorithm is guaranteed with theoretical justification. Furthermore, the consistent properties of the proposed estimator are also established under Frobenius norm with some regularity conditions. The simulation and real-data analysis elaborate that the proposed method performs much better than several existing approaches in terms of both estimation accuracy and sparsity accuracy.

The remainder of this work is organized as follows. Section 2 briefly reviews the MCD approach to estimate the covariance matrix. Section 3 introduces the proposed method by addressing the order issue. An efficient algorithm is also developed to solve the objective function. In Section 4, the theoretical properties are presented. The simulation study and one real data example are reported in Section 5 and 6, respectively. We conclude the paper in Section 7.

## 2 The Modified Cholesky Decomposition

Without loss of generality, suppose that is a

-dimensional random vector with mean

and covariance matrix . Let be independent and identically distributed observations following . Pourahmadi (1999) proposed the modified Cholesky decomposition (MCD) for the estimation of a covariance matrix, which is statistically meaningful and grantees the positive definiteness of the estimate. This decomposition arises from regressing each variable on its predecessors for . Specifically, consider to fit a series of regressions

 Xj=j−1∑k=1(−tjk)Xk+ϵj=^Xj+ϵj,

where is the error term for the th regression with and . Let and be the diagonal covariance matrix of . Construct the unit lower triangular matrix with ones on its diagonal and regression coefficients as its th row. Then one can have

 \boldmathD\unboldmath=Var(\boldmathϵ% \unboldmath)=Var(\boldmathX\unboldmath−^\boldmathX% \unboldmath)=Var(\boldmathT\unboldmath\boldmathX\unboldmath% )=\boldmathT\unboldmath\boldmathΣ\unboldmath% \boldmathT\unboldmath′,

and thus

 \boldmathΣ\unboldmath=\boldmathT\unboldmath% −1\boldmathD\unboldmath\boldmathT\unboldmath′−1. (2.1)

The MCD approach reduces the challenge of modeling a covariance matrix into the task of modeling regression problems, and is applicable in high dimensions. However, directly imposing the sparse structure on Cholesky factor matrix in (2.1) does not imply the sparse pattern of covariance matrix since it requires an inverse of . Thus the formulation (2.1) is not convenient to impose a sparse structure on the estimation of . Alternatively, one can consider a latent variable regression model based on the MCD. Writing would lead to

 Var(\boldmathX\unboldmath) =Var(\boldmathL\unboldmath\boldmathϵ% \unboldmath) Σ =\boldmathL\unboldmath\boldmathD\unboldmath\boldmathL\unboldmath′. (2.2)

This decomposition can be interpreted as resulting from a new sequence of regressions, where each variable is regressed on all the previous latent variable rather than themselves. It gives a sequence of regressions

 Xj=\boldmathl\unboldmathTj\boldmathϵ% \unboldmath=∑k

where is the th row of . Here and for .

With the data matrix , define its th column to be . Let denote the residuals of for , and . Let be the matrix containing the first residuals. Now we consider a sparse covariance matrix estimate by encouraging the sparsity on . One approach to achieving such sparsity is to use the Lasso for the regression (Tibshirani, 1996)

 ^\boldmathl\unboldmathj=argmin% \boldmathl\unboldmathj∥\boldmathx\unboldmath(j)−Z(j)\boldmathl\unboldmathj∥22+ηj∥\boldmathl\unboldmathj∥1, j=2,…,p, (2.4)

where is a tuning parameter and selected by cross validation. stands for the vector norm. is used to construct the residuals for the last column of . Then

is estimated as the sample variance of

 ^d2j=ˆVar(^\boldmathe\unboldmath(j))=ˆVar(\boldmathx\unboldmath(j)−Z(j)^\boldmathl\unboldmathj) (2.5)

when constructing matrix . Hence, will be a sparse covariance matrix estimate.

## 3 The Proposed Method

Clearly, the estimate

depends on the order of random variables

. It means that different orders would lead to different sparse estimates of . To address this order-dependent issue, we consider an ensemble estimation of by using the idea of permutation. Specifically, we generate different permutations of as the orders of the random variables, denoted by , . Let be the corresponding permutation matrix. Under a variable order , the estimate is obtained as

 ^\boldmathΣ\unboldmathπk=^% \boldmathL\unboldmathπk^\boldmathD\unboldmathπk^\boldmathL\unboldmath′πk, (3.1)

where and are calculated based on (2.4) and (2.5). Then transforming back to the original order, we have

 ^\boldmathΣ\unboldmathk=\boldmathP% \unboldmathπk^\boldmathΣ\unboldmathπk%\boldmath$P$\unboldmath′πk. (3.2)

To obtain an ensemble estimator for , a naive solution would be . However, such an estimate may not be sparse since the sparse structure in is destroyed by the average.

In order to simultaneously achieve the positive definiteness and sparsity for the estimator, we propose to consider the estimate

 ^\boldmathΣ\unboldmath=argmin% \boldmathΣ\unboldmath⪰ν\boldmathI\unboldmath12MM∑k=1∥\boldmathΣ\unboldmath−^\boldmathΣ\unboldmathk∥2F+λ|\boldmathΣ\unboldmath|1, (3.3)

where stands for the Frobenius norm, is a tuning parameter, and is norm for all the off-diagonal elements. Here is some positive arbitrarily small number. The constraint is to guarantee the positive-definiteness of the estimate. The penalty term is to encourage the sparse pattern in . It is worth pointing out that, if in (3.3), the solution of would be ; if without the constraint in (3.3), the solution of would be the soft-threshold estimate of . Therefore, the proposed estimate is to pursue the “center” of the multiple estimates , while maintain the properties of being positive-definite and sparse.

To efficiently solve the optimization in (3.3), we employ the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). The ADMM technique has been widely used in solving the convex optimization under the content of penalized covariance matrix estimation (Xue, Ma and Zou, 2012). The ADMM technique does not require the differentiability assumption of the objective function and it is easy to implement. Specifically, let us first introduce a new variable and an equality constraint as follows

 (^\boldmathΣ\unboldmath,^\boldmathΦ\unboldmath)=argmin\boldmathΣ\unboldmath,% \boldmathΦ\unboldmath{12MM∑k=1∥\boldmathΣ\unboldmath−^\boldmathΣ\unboldmathk∥2F+λ|\boldmathΣ\unboldmath|1:\boldmathΣ% \unboldmath=\boldmathΦ\unboldmath,\boldmathΦ% \unboldmath⪰ν\boldmathI\unboldmath}. (3.4)

Note that the solution of (3.4) gives solution to (3.3). To solve (3.4), we minimize its augmented Lagrangian function for some given penalty parameter as

 L(\boldmathΣ\unboldmath,\boldmathΦ% \unboldmath;\boldmathΛ\unboldmath)=12MM∑k=1∥\boldmathΣ\unboldmath−^\boldmathΣ\unboldmathk∥2F+λ|\boldmathΣ\unboldmath|1−⟨% \boldmathΛ\unboldmath,\boldmathΦ\unboldmath−% \boldmathΣ\unboldmath⟩+12τ∥\boldmathΦ% \unboldmath−\boldmathΣ\unboldmath∥2F, (3.5)

where is the Lagrangian multiplier. The notation stands for the matrix inner product as , where and are the elements of matrices and . The ADMM iteratively solves the following steps sequentially for till convergence

 \boldmathΦ\unboldmath step:\boldmathΦ\unboldmathi+1=argmin\boldmathΦ\unboldmath⪰ν\boldmathI\unboldmathL(\boldmathΣ\unboldmathi,%\boldmath$Φ$\unboldmath;\boldmathΛ\unboldmathi) (3.6)
 \boldmathΣ\unboldmath step:\boldmath% Σ\unboldmathi+1=argmin\boldmathΣ\unboldmathL(\boldmathΣ\unboldmath,\boldmathΦ\unboldmathi+1;\boldmathΛ\unboldmathi) (3.7)
 \boldmathΛ\unboldmath step:% \boldmathΛ\unboldmathi+1=\boldmathΛ\unboldmathi−1τ(\boldmathΦ\unboldmathi+1−\boldmathΣ% \unboldmathi+1). (3.8)

Assume the eigenvalue decomposition of a matrix is , and define . Then we develop the closed form for (3.6) as

 ∂L(\boldmathΣ\unboldmathi,% \boldmathΦ\unboldmath;\boldmathΛ\unboldmathi)∂\boldmathΦ\unboldmath =−\boldmathΛ\unboldmathi+1τ(% \boldmathΦ\unboldmath−\boldmathΣ\unboldmathi)≜0 Φ =\boldmathΣ\unboldmathi+τ\boldmathΛ\unboldmathi \boldmathΦ\unboldmathi+1 =(\boldmathΣ\unboldmathi+τ\boldmathΛ\unboldmathi)+.

Next, define an element-wise soft threshold for each entry in matrix as with

 \boldmaths\unboldmath(zij,δ)=sign(zij)max(|zij|−δ,0)I{i≠j}+zijI{i=j}.

Then the solution of (3.7) is derived as

 ∂L(\boldmathΣ\unboldmath,% \boldmathΦ\unboldmathi+1;\boldmathΛ\unboldmathi)∂\boldmathΣ\unboldmath =1MM∑k=1(\boldmathΣ\unboldmath−^\boldmathΣ\unboldmathk)+\boldmathΛ% \unboldmathi+1τ(\boldmathΣ\unboldmath−% \boldmathΦ\unboldmathi+1)+λsign∗(\boldmathΣ\unboldmath)≜0 (τ+1)\boldmathΣ\unboldmath =τ(1MM∑k=1^\boldmathΣ% \unboldmathk−\boldmathΛ\unboldmathi)+\boldmathΦ\unboldmathi+1−λτsign∗(\boldmathΣ% \unboldmath) \boldmathΣ\unboldmathi+1 ={\boldmaths\unboldmath(τ(1MM∑k=1^\boldmathΣ\unboldmathk−\boldmathΛ% \unboldmathi)+\boldmathΦ\unboldmathi+1,λτ)}/(τ+1),

where means with the diagonal elements replaced by vector. Algorithm 1 summarizes the developed procedure for solving (3.3) by using the ADMM technique.

###### Algorithm 1.

Step 1: Input initial values , and .

Step 2: .

Step 3: .

Step 4: .

Step 5: Repeat Step 2 - 4 till convergence.

This algorithm converges fast and produces the optimal solution of in (3.5). In practice, the initial estimate is set to be the naive estimate . The

is set to be zero matrix, and

and (Xue, Ma and Zou, 2012). The optimal value of tuning parameter in (3.5) is chosen based on Bayesian information criterion (BIC) (Yuan and Lin, 2007)

 BIC(λ)=−log|^\boldmathΣ\unboldmath% −1λ|+tr[^\boldmathΣ\unboldmath−1λ\boldmathS\unboldmath]+lognn∑i≤j^eij(λ),

where is the sample covariance matrix, indicates the estimate of obtained by applying our algorithm with tuning parameter . if , and otherwise.

Note that the implementation of the proposed method also requires the choice of , the number of permutation orders. Obviously, the number of all possible permeation orders is , which increases rapidly as the number of variables increases. To get an appropriate value for for efficient computation, we have tried and as the potential number of random orders. The performances were quite comparable for larger than 30. Hence, we choose as the number of permutation orders for the proposed method in this paper. The more detailed discussion and justification on the choice of can be found in Kang and Deng (2017).

## 4 Theoretical Properties

In this section, Theorem 1 states that the sequence generated by Algorithm 1 from any starting point numerically converges to an optimal minimizer of (3.5), where is the optimal dual variable. Theorem 2 provides the asymptotic convergence rate of the estimator in (3.2) under the Frobenius norm. Theorem 3 and 4 demonstrate the asymptotic properties of the proposed estimator under some weak regularity conditions. The details of the proofs of Theorems 1- 4 are in the Appendix. To facilitate the presentation of the proofs, we first introduce some notations. Define a by matrix as

 \boldmathJ\unboldmath=(τ\boldmathI\unboldmathp×p00τ−1\boldmathI\unboldmathp×p).

Let the notation be and . Let be the true covariance matrix for the observations , and define the number of nonzero off-diagonal elements of as . Denote the maximal true variance in by . Let be the collection of nonzero elements in the lower triangular part of matrix . Denote by the maximum of the cardinality of for . Now we present the following lemma and theory.

###### Lemma 1.

Assume that is an optimal solution of (3.4) and is the corresponding optimal dual variable with the equality constraint , then the sequence generated by Algorithm 1 satisfies

 ∥\boldmathW\unboldmath+−\boldmathW% \unboldmathi∥2J−∥\boldmathW\unboldmath+−% \boldmathW\unboldmathi+1∥2J≥∥\boldmathW\unboldmathi−\boldmathW\unboldmathi+1∥2J,

where and .

###### Theorem 1.

Suppose are independent and identically distributed observations from . Then the sequence generated by Algorithm 1 from any starting point converges to an optimal minimizer of the objective function in (3.5).

Theorem 1 demonstrates the convergence of Algorithm 1. It automatically indicates that the sequence , produced by Algorithm 1 converges to an optimal solution of the objective (3.3).

In order to achieve the consistent property of in (3.2) obtained under each order using the MCD approach, one needs a basic assumption that there exists a constant

such that the singular values of the true covariance matrix are bounded as

 1/θ

where we use to indicate the singular values of matrix in a decreasing order. They are the square root of the eigenvalues of matrix . This assumption is also made in Rothman (2008), Lam and Fan (2009) and Guo et al. (2011). The assumption guarantees the positive definiteness property and makes inverting the covariance matrix meaningful. The following lemma and theorem give the convergence rate of in (3.2) under the Frobenius norm.

###### Lemma 2.

Let be the MCD of the true covariance matrix. Since the singular values of are bounded, there exist constants and such that , then there exist constants and such that

 h1

and

 h1
###### Lemma 3.

Let be the MCD of the true covariance matrix regarding a variable order . Under (4.1), assume the tuning parameters in (2.4) satisfy ; and satisfy , then and in (3.1) have the following consistent properties

 ∥^\boldmathL\unboldmathπk−\boldmathL\unboldmath0πk∥F=Op(√s1log(p)/n), ∥^\boldmathD\unboldmathπk−\boldmathD\unboldmath0πk∥F=Op(√plog(p)/n).
###### Theorem 2.

Assume that all the conditions in Lemma 3 hold. Then the estimator in (3.2) has the following consistent property

 ∥^\boldmathΣ\unboldmathk−\boldmathΣ\unboldmath0∥F=Op(√(s1+p)log(p)/n).

Theorem 2 establishes the consistent property of the estimate under variable order . From this result, the convergence rate of the proposed estimator for estimating a sparse covariance matrix can be obtained based on Frobenius norm by Theorem 3 and 4 as follows.

###### Theorem 3.

Assume all the conditions in Lemma 3 hold and . Under the exponential-tail condition that for all and

 E{exp(tx2ij)}≤K1.

For any , set

 λ=c20log pn+c1(log pn)1/2,

where

 c0=12eK1ρ1/2+ρ−1/2(m+1)

and

 c1=2K1(ρ−1+14ρσ2max) exp(12ρσmax) + 2ρ−1(m+2).

With probability at least

, we have

 ||^\boldmathΣ\unboldmath+−\boldmathΣ% \unboldmath0||F≤5λ(s0+p)1/2.
###### Theorem 4.

Assume all the conditions in Lemma 3 hold and for some . Under the polynomial-tail condition that for all and

 E{|xij|4(1+γ+ε)}≤K2.

For any , set

 λ=8(K2+1)(m+1)logpn+8(K2+1)(m+2)(logpn)1/2.

With probability at least , we have

 ||^\boldmathΣ\unboldmath+−\boldmathΣ% \unboldmath0||F≤5λ(s0+p)1/2.

## 5 Simulation Study

In this section, we conduct a comprehensive simulation study to evaluate the performance of the proposed method. Suppose that data

are generated independently from the normal distribution

. Here we consider the following four covariance matrix structures.

Model 1. = MA(0.5, 0.3). The diagonal elements are 1 with the first sub-diagonal elements 0.5 and the seconde sub-diagonal elements 0.3.

Model 2. = AR(0.5). The conditional covariance between any two random variables and is fixed to be , .

Model 3. is generated by randomly permuting rows and corresponding columns of .

Model 4. is generated by randomly permuting rows and corresponding columns of .

Note that Models 1-2 consider the banded or nearly-banded structure for the covariance matrix. While the sparse pattern of the covariance matrix in Models 3-4 is not structured due to the random permutation. For each case, we generate the data set with three settings of different sample sizes and variable sizes: (1) ; (2) and (3) .

The performance of the proposed estimator is examined in comparison with several other approaches, which are divided into three classes. The first class is the sample covariance matrix that serves as the benchmark. The second class is composed of three methods that deal with the variable order used in the MCD, including the MCD-based method with BIC order selection (BIC) (Dellaportas and Pourahmadi, 2012), the best permutation algorithm (BPA) (Rajaratnam and Salzman, 2013) and the proposed method (Proposed). The BPA selects the order of variables used in the MCD approach (2) such that is minimized. While the BIC method determines the order of variables in the MCD approach (2.3) in a forward selection fashion. That is, in each step, it selects a new variable having the smallest value of BIC when regressing it on its previous residuals. For example, suppose that is the candidate set of variables and there are variables already chosen and ordered. By regressing each , on the residuals , we can assign the variable corresponding to the minimum BIC value among the regressions to the th position of the order.

The third class of competing methods consists of four approaches that estimate the sparse covariance matrix directly without considering the variable order, including Bien and Tibshirani’s estimate (BT) (Bien and Tibshirani, 2011), Bickel and Levina’s estimate (BL) (Bickel and Levina, 2008), Xue, Ma and Zou’s estimate (XMZ) (Xue, Ma and Zou, 2012) and Rothman et al.’s estimate (RLZ) (Rothman et al., 2010). The BT estimate minimizes the negative log-likelihood function with penalty on the entries of the covariance matrix, that is,

 ^\boldmathΣ\unboldmathBT=argmin\boldmathΣ\unboldmath≻0{−log|\boldmathΣ\unboldmath−1|+tr(\boldmathΣ\unboldmath−1\boldmathS% \unboldmath)+η|\boldmathΣ\unboldmath|1},

where is the tuning parameter. The optimization is solved by using the majorization-minimization algorithm (Lange, Hunter and Yang, 2000). The BL estimate is obtained by imposing hard thresholding (Bickel and Levina, 2008) on the entries of the sample covariance matrix, so the resultant estimate may not be positive definite. The XMZ estimate is obtained from encouraging the sparsity on the sample covariance matrix as well as maintaining the property of positive definiteness. The resultant estimate is

 ^\boldmathΣ\unboldmathXMZ=argmin% \boldmathΣ\unboldmath⪰ν\boldmathI\unboldmath12∥\boldmathΣ\unboldmath−\boldmathS\unboldmath∥2F+η|\boldmathΣ\unboldmath|1.

The RLZ estimate is to introduce sparsity in the Cholesky factor matrix by estimating the first sub-diagonals of and setting the rest to zeroes. It means that each variable is only regressed on the previous residuals in (2.3). The tuning parameter can be chosen by AIC or cross validation.

To measure the accuracy of covariance matrix estimates obtained from each approach, we consider the entropy loss (EN), quadratic loss (QL), norm and mean absolute error (MAE), defined as follows:

 EN =tr[\boldmathΣ\unboldmath−1^% \boldmathΣ\unboldmath]−log|\boldmathΣ\unboldmath−1^\boldmathΣ\unboldmath|−p, QL =1p tr[^\boldmathΣ\unboldmath−1\boldmathΣ\unboldmath−\boldmathI\unboldmath]2, L1 norm =maxj∑i|^σij−σij|, MAE =1pp∑i=1p∑j=1|^σij−σij|.

Here we have not included the Kullback-Leibler loss (Kullback and Leibler, 1951), since it is more suitable in measuring inverse covariance matrix estimates rather than covariance matrix estimates (Levina et al., 2008). In addition, to gauge the performance of capturing sparse structure, we consider the false selection loss (FSL), which is the summation of false positive (FP) and false negative (FN). Here we say a FP occurs if a nonzero element in the true matrix is incorrectly estimated as a zero. Similarly, a FN occurs if a zero element in the true matrix is incorrectly identified as a nonzero. The FSL is computed in percentage as (FP + FN) /

. For each loss function above, Table

1 to Table 4

report the averages of the performance measures and their corresponding standard errors in the parentheses over 100 replicates. For each model, the two methods with lowest averages regarding each measure are shown in bold. Dashed lines in the tables represent the corresponding values not available due to matrix singularity.

For a short summary of Tables 1-4, the numerical results show that the proposed method generally provides a superior performance over other approaches in comparison. It is able to accurately catch the underlying sparse structure of the covariance matrix. When the underlying covariance matrix is banded or tapered, the proposed method is comparable to the RLZ method, and performs better than other approaches. Note that the RLZ method targets on the banded covariance matrix. When the underlying structure of covariance matrix is more general without any specification, the proposed method still performs well. As in the high-dimensional cases, the advantage of the proposed method is even more evident.

Specifically, Table 1 and 2 summarize the comparison results for Models 1 and 2, respectively. From the perspective of competing methods, the sample covariance matrix , serving as a benchmark approach, does not give the sparse structure and performs poorly under all the loss measures. The BIC and BPA in the second class of approaches provide sparse covariance matrix estimates compared with , but their false selection loss (FSL) are considerably larger than the proposed method. Moreover, the proposed method greatly outperforms the BIC and BPA regarding QL, and MAE for all settings of and 100. Although the proposed method is comparable to the BIC and BPA methods under EN criterion when , it performs slightly better when and much better in the case of .

For the BT method in the third class of approaches, the proposed method significantly outperforms it in capturing the sparse structure for the cases of and . Furthermore, the proposed method gives superior performance to the BT with respect to all the other loss criteria, especially QL measure. In comparison with the BL method, the proposed method performs similarly. It is known that the BL method is asymptotically optimal for sparse covariance matrix (Bickel and Levina, 2008). However, the estimated covariance matrix obtained from the BL method does not guarantee to be positive definite, which would result in matrix singularity in computing EN and QL losses. Compared with XMZ approach, the proposed method is superior or comparable with respect to all the loss measures except QL criterion in the settings of and 50. In the high dimensional case when