1 Introduction
We consider the estimation of the matrix inversion satisfying , given a data matrix . When is a sample covariance matrix, our problem is the estimation of the inverse of the corresponding population covariance matrix. The inverse covariance matrix is also called precision matrix or concentration matrix. With the dramatic advances in technology, the number of covariates is of greater order than the sample size in many statistical and engineering applications. In this case, the sample covariance matrix is always singular and thus it is difficult to compute the precision matrix. In such cases, a certain type of sparsity condition is required for proper estimation the precision matrix and for theoretical investigation of the estimation problem. In this paper, we will impose for simplicity an (maximum degree) sparsity condition on the target inverse matrix .
Many approaches have been proposed to estimate the sparse inverse matrix in the high dimensional setting. The penalization is one of the most popular methods. Lassotype methods, or convex minimization algorithms with the penalty on all entries of , have been discussed by Banerjee, El Ghaoui and d’Aspremont (2008), Friedman, Hastie and Tibshirani (2008) and more, and by Yuan and Lin (2007) with penalization on the offdiagonal matrix only. This is refereed to as the graphical Lasso (GLasso) due to the connection of the precision matrix to Gaussian Markov graphical models. In this GLasso framework, Rothman, Bickel, Levina and Zhu (2008) proved the convergence rate in the Frobenius norm and in the spectrum norm, where is the number of nonzero entries in the offdiagonal matrix. Ravikumar, Wainwright, Raskutti, and Yu (2008) provided sufficient conditions for model selection consistency of this regularized MLE. Lam and Fan (2009) studied on a general penalty function and achieved a sharper bound of order under the Frobenius norm for the penalty. Since the spectrum norm can be controlled via the Frobenius norm, this provides a sufficient condition for the convergence under the spectrum norm to the unknown precision matrix. This is a very strong condition since is of the order for banded precision matrices, where is the matrix degree, i.e. the largest number of nonzero entries in the columns.
Some recent work suggests a weaker sufficient condition with the matrix degree. Yuan (2010) estimated each column of the inverse matrix by Dantzig selector and then seek a symmetric matrix close to the column estimation. When norm of the precision matrix is bounded, this method can achieve a convergence rate of order based on several matrix norms. The CLIME estimator, introduced by Cai, Liu and Luo (2011), has the same order of convergence rate, which uses the plugin method with Dantzig selector to estimate each column, but followed by a simpler symmetrization step. They also require the boundedness of the norm of the unknown. In Yang and Kolaczyk (2010), the Lasso is applied to estimate the columns of the target matrix under the assumption of equal diagonal, and the estimation error is studied in the Frobenius norm for . This columnbycolumn idea reduces a graphical model to a regression model. It was first introduced in Meinshausen and Bülhmann (2006) for identifying nonzero variables in a graphical model, called neighborhood selection.
In this paper, we propose to apply the scaled Lasso (Sun and Zhang, 2011) columnbycolumn to estimate a precision matrix in the high dimensional setting. Based on the connection of precision matrix to linear regression by the block inversion formula, we construct a column estimator with the scaled Lasso, a joint estimator for the regression coefficients and noise level. Since we only need the sample covariance matrix in our procedure, this estimator could be extended to generate an approximate inverse of a nonnegative data matrix in a general setting. This scaled Lasso algorithm provides a fully specified map from the space of nonnegativedefinite matrices to the space of symmetric matrices. For each column, the penalty level of the scaled Lasso is determined by data via convex minimization, without using crossvalidation.
We study theoretical properties of the proposed estimator for a precision matrix under a normality assumption. More precisely, we assume that the data matrix is the sample covariance matrix , where the rows of are iid . Under conditions on the spectrum norm and degree of the inverse of , we prove that the proposed estimator guarantees the rate of convergence of order in the spectrum norm. The conditions are weaker than those in the existing analyses of other algorithms, which typically require the boundedness of the norm. When the norm of the target matrix diverges to infinity, the analysis of the proposed estimator guarantees a faster convergence rate than that of the existing literature. We state this main result of the paper in the following theorem.
Theorem 1
Let be the scaled Lasso estimator, defined in (5), (6) and (7) below, based on iid observations from . Let and
be the smallest and largest eigenvalues of correlation matrix of
, be the inverse of and be the maximum degree of . Suppose that , the diagonal entries of the target matrix are uniformly bounded, is bounded from 0 and for a small fixed . Then, the spectrum norm of the estimation error is bounded byThe convergence of the proposed scaled Lasso estimator under the sharper spectrum norm condition on , instead of the stronger bounded condition, is not entirely technical. It is a direct consequence of the faster convergence rate of the scaled Lasso estimator of the noise level in linear regression. To the best of our knowledge, it is unclear if other algorithms also achieve this fast convergence rate, either for the estimation of the noise level in linear regression or for the estimation of a precision matrix under the spectrum norm. However, it is still possible that this difference between the scaled Lasso and other methods is due to potentially coarser specification of the penalty level in other algorithms (e.g. cross validation) or a less accurate error bound in other analyses.
The rest of the paper is organized as follows. In Section 2, we present the estimation for the inversion of a nonnegative definite matrix via the scaled Lasso. In Section 3, we study error bounds of the proposed estimator for precision matrix. Simulation studies are presented in Section 4. In Section 5, we discuss oracle inequalities for the scaled Lasso with unnormalized predictors and the estimation of inverse correlation matrix. Section 6 includes all the proofs.
We use the following notation throughout the paper. For a vector
, is the norm with the special and the usual extensions and . For matrices , is the th column of , represents the submatrix of with rows in and columns in , is the matrix norm. In particular, is the spectrum norm for symmetric matrices. Moreover, we denote the set by and denote the set by in the subscripts.2 Matrix inversion via scaled Lasso
Let be a nonnegativedefinite data matrix and be a positivedefinite target matrix with . In this section, we describe the relationship between positivedefinite matrix inversion and linear regression and propose an estimator for via scaled Lasso, a joint convex minimization for the estimation of regression coefficients and noise level.
We use scaled Lasso to estimate column by column. Define and by
(1) 
In the matrix form, we have the following relationship
(2) 
Let . Since at , one may estimate the th column of by minimizing the penalized quadratic loss. In order to shrink the estimation coefficients on the same scale, we adjust the penalty function with a normalizing factor, which leads to the penalized quadratic loss as follows,
subject to . This is actually the Lasso for a linear regression model with normalized preditors. In practice, we first normalize the predictors by the weights and then the minimization problem can be solved by algorithms for the Lasso estimation. This is similar to Yuan (2010) and Cai, Liu and Luo (2011) who used the Dantzig selector to estimate each column. However, one still needs to choose a penalty level and to estimate to recover via (2). A solution to resolve these two issues is the scaled Lasso (Sun and Zhang, 2011):
(3) 
where with a fixed . The scaled Lasso (3) is a solution of joint convex minimization over due to the convexity in (Huber, 2009; Antoniatis, 2010). Since ,
Thus, (3) is expected to yield consistent estimates of .
Sun and Zhang (2011) provided an iterative algorithm to compute the scaled Lasso estimator (3). We rewrite the algorithm in the form of matrices. For each , the Lasso path is given by the estimates satisfying the following KKT conditions, for all ,
(4) 
where . Based on the Lasso path , the scaled Lasso estimator is computed iteratively by
(5) 
Here the penalty level of the Lasso is determined by the data without using crossvalidation. We then simply take advantage of the relationship (2) and compute the coefficients and noise levels by the scaled Lasso for each column
(6) 
It is noticed that a good estimator for should be a symmetric matrix. However, the estimator does not have to be symmetric. We improve this estimator by using a symmetrization step as in Yuan (2010),
(7) 
which can be solved by linear programming. Alternatively, semidefinite programming, which is somewhat more expensive computationally, can be used to produce a nonnegative definite
in (7). According to the definition, the new estimator has the same error rate as . A nice property for symmetric matrix is that the spectrum norm is bounded by the matrix norm. The matrix norm can be given more explicitly as the maximum norm of the columns, while the matrix norm is the maximum norm of the rows. Hence, for any symmetric matrix, the matrix norm is equivalent to the matrix norm, so the spectrum norm can be bounded by either of them. Since both our estimator and the target matrix are symmetric, the error bound based on the spectrum norm could be studied by bounding the error, as typically done in the existing literature. We will discuss these error bounds in Section 3.To sum up, we propose to estimate the matrix inversion by (5), (6) and (7). The iterative algorithm (5) computes the regression coefficients and noise level based on a Lasso path determined by (4). Then (6) translates the resulting estimators of (5) to column estimators and thus a preliminary matrix estimator is constructed. Finally, the symmetrization step (7) produces a symmetric estimate for our target matrix.
3 Error bounds for precision matrix
In this section, we study the error for the inverse of a covariance matrix, which is our primary example of the target matrix. From now on, we suppose that the data matrix is the sample covariance matrix , where the rows of are iid , and the target matrix is .
Let and be the smallest and the largest eigenvalues of the correlation matrix . Define and the degree of the matrix
The following theorem gives the convergence rate based on the matrix norm ( matrix norm) and spectrum norm.
Theorem 2
Let and with . Suppose that for a small fixed
. Then with probability greater than
,(8)  
where and are constants depending on only.
Since the entries of are bounded by the maximum of the diagonal, the matrix norm is of the same order as the matrix degree . Thus, the inequality (8) provides a convergence rate of the order for either the matrix norm or the spectrum norm under the conditions , and . The first condition is the main sparsity condition, and the other two are actually conditions on the norm of the target matrix. To achieve the same convergence rate, Yuan (2010) and Cai, Liu and Luo (2011) both imposed the condition and the boundedness of the norm of the unknown. We replace the condition by the weaker boundedness of the spectrum norm of the unknown. The spectrum norm condition on the unknown is not only weaker, but also natural for the convergence in spectrum norm. The extra condition here is not strong. Under the conditions and , the extra condition only requires to be small and it allows to diverge to infinity.
This sharper error bound in the spectrum norm is a consequence of using the scaled Lasso estimator (3). Sun and Zhang (2011) gave a convergence rate of order for the scaled Lasso estimation of the noise levels . With this faster rate of convergence, the estimation error in the diagonal is no longer the main term and thus the condition of the bounded norm of can be weakened.
The consistency of the scaled Lasso estimation for the noise level is based on the error bound for the regression coefficients. Oracle inequalities for the error of the Lasso have been studied with various conditions, including the restricted isometry condition (Candes and Tao, 2007), the compatibility condition (van de Geer, 2007) and the signrestricted cone invertibility factor (Ye and Zhang, 2010) among others. Sun and Zhang (2011) extended these oracle inequalities for the scaled Lasso. Here we use the version under the condition of signrestricted cone invertibility factor (SCIF)
(9) 
with the cone and the signrestricted cone . It is proved that, conditional on ,
(10) 
under the condition that is bounded away from 0. This is guaranteed by the conditions of Theorem 2. The error bound of matrix norm then follows from (10).
4 Numerical study
In this section, we compare the proposed matrix estimator based on scaled Lasso with graphical Lasso and CLIME (Cai, Liu and Luo, 2011). Three models are considered. The first two models are the same as model 1 and model 2 in Cai, Liu and Luo (2011). Model 2 was also studied in Rothman et al. (2008).

Model 1: .

Model 2: Let , where each offdiagonal entry in is generated independently and equals to 0.5 with probability 0.1 or 0 with probability 0.9. is chosen such that the condition number of is . Finally, we rescale the matrix to the unit in diagonal.

Model 3: The diagonal of the target matrix has unequal values. , where and is a diagonal matrix with diagonal elements .
For each model, we generate a training sample of size 100 from a multivariate normal distribution with mean zero and covariance matrix
and an independent sample of size 100 from the same distribution for validating the tuning parameter for the graphical Lasso and CLIME. The GLasso and CLIME estimators are computed based on training data with various ’s and we choose by minimizing likelihood loss on the validation sample. The proposed scaled Lasso estimator is computed based on the training sample alone with the penalty level . Consider 6 different dimensions and replicate 100 times for each case. The CLIME estimators for and are not computed due to the computational costs.Table 1 presents the mean and standard deviation of estimation errors based on 100 replications. The estimation error is measured by several matrix norms: spectrum norm, matrix
norm and Frobenius norm. We can see that scaled Lasso estimator, labelled as SLasso, outperforms the graphical Lasso (GLasso) in all cases, while it has a comparable peformance with the CLIME.Model 1  

Spectrum norm  Matrix norm  Frobenius norm  
SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  
30  2.41(0.08)  2.49(0.14)  2.29(0.21)  2.93(0.11)  3.09(0.11)  2.92(0.17)  4.09(0.12)  4.24(0.26)  3.80(0.36) 
60  2.61(0.05)  2.94(0.05)  2.68(0.10)  3.10(0.09)  3.55(0.07)  3.27(0.09)  6.16(0.10)  7.15(0.15)  6.32(0.28) 
90  2.67(0.05)  3.07(0.03)  2.87(0.09)  3.19(0.08)  3.72(0.06)  3.42(0.07)  7.73(0.11)  9.25(0.12)  8.42(0.31) 
150  2.74(0.04)  3.19(0.02)  3.05(0.04)  3.28(0.08)  3.88(0.06)  3.55(0.06)  10.22(0.13)  12.55(0.09)  11.68(0.20) 
300  2.80(0.03)  3.29(0.01)  NA  3.38(0.07)  4.06(0.05)  NA  14.77(0.11)  18.44(0.09)  NA 
1000  2.87(0.03)  3.39(0.00)  NA  3.52(0.06)  4.44(0.07)  NA  27.59(0.12)  35.11(0.06)  NA 
Model 2  
Spectrum norm  Matrix norm  Frobenius norm  
SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  
30  0.75(0.08)  0.82(0.07)  0.81(0.09)  1.32(0.16)  1.49(0.15)  1.45(0.18)  1.90(0.10)  1.84(0.09)  1.87(0.11) 
60  1.07(0.05)  1.15(0.06)  1.19(0.08)  1.97(0.16)  2.21(0.12)  2.20(0.23)  3.31(0.08)  3.18(0.13)  3.42(0.09) 
90  1.49(0.04)  1.54(0.05)  1.61(0.04)  2.63(0.16)  2.89(0.16)  2.90(0.17)  4.50(0.06)  4.40(0.11)  4.65(0.08) 
150  1.98(0.03)  2.02(0.05)  2.06(0.03)  3.31(0.17)  3.60(0.15)  3.65(0.19)  6.02(0.05)  6.19(0.16)  6.33(0.08) 
300  2.85(0.02)  2.89(0.02)  NA  4.50(0.14)  4.92(0.17)  NA  9.35(0.05)  9.79(0.05)  NA 
1000  5.35(0.01)  5.52(0.01)  NA  7.30(0.21)  7.98(0.15)  NA  18.34(0.05)  20.81(0.02)  NA 
Model 3  
Spectrum norm  Matrix norm  Frobenius norm  
SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  SLasso  GLasso  CLIME  
30  1.75(0.10)  2.08(0.10)  1.63(0.19)  2.24(0.14)  2.59(0.10)  2.17(0.20)  2.52(0.10)  2.91(0.16)  2.37(0.25) 
60  2.09(0.08)  2.63(0.04)  2.10(0.10)  2.58(0.13)  3.10(0.05)  2.65(0.14)  3.81(0.09)  4.84(0.08)  3.98(0.13) 
90  2.24(0.07)  2.84(0.03)  2.38(0.18)  2.72(0.12)  3.30(0.06)  2.91(0.12)  4.79(0.08)  6.25(0.08)  5.37(0.37) 
150  2.40(0.06)  3.06(0.02)  2.76(0.05)  2.89(0.11)  3.45(0.04)  3.18(0.09)  6.35(0.09)  8.43(0.07)  7.75(0.08) 
300  2.54(0.05)  3.26(0.01)  NA  3.05(0.10)  3.58(0.03)  NA  9.20(0.09)  12.41(0.04)  NA 
1000  2.68(0.05)  3.47(0.01)  NA  3.26(0.09)  3.73(0.03)  NA  17.2(0.09)  23.55(0.02)  NA 
5 More results
5.1 Oracle inequalities for scaled Lasso estimator
In the proof of the theoretical results for the proposed estimator, we use oracle inequalities for the estimation error associated with a linear model without normalizing the predictors. In the discussion section, we describe this aspect of our results. Consider a linear model as follows,
Let , , and . In order to penalize the coefficients on the same scale, we use a weighted norm of the coefficients as the penalty function. Consider the estimator
(11) 
This is actually the scaled Lasso as we use in matrix estimation in Section 2. It is equivalent to the estimation based on normalized predictors:
(12) 
with .
The following theorem gives the oracle inequalities for the estimation of regression coefficients and noise level.
Theorem 3
Theorem 3 is an immediate extension from the oracle inequalities for the scaled Lasso in Sun and Zhang (2011). With an extra condition that are uniformly bounded from zero, the estimators have the same convergence rate as that for a regression model with normalized predictors. The error rates (10) follows from Theorem 3 and are used to prove the convergence rate of matrix estimation.
5.2 Error bounds for inverse correlation matrix
Given the data matrix , we are also interested in estimating the inverse correlation matrix
where . When constructing the matrix estimator , we solve the linear regression problem with normalized predictors. If the estimator in (6) is replaced by as in (12), the resulting matrix is an estimator for . Thus, the inverse correlation matrix is estimated by
(15)  
(16) 
The error bounds of this estimator are well established as follows.
Theorem 4
Let , with and . Suppose that for a small fixed . Then with probability greater than ,
(17)  
(18) 
where , and are constants depending on only.
Theorem 4 shows that the error bounds are also of the order under proper conditions. The crucial condition here is still the boundedness of the spectrum norm of the unknown matrix.
6 Proofs
In this section, we provide the proofs of Theorem 3, Theorem 2 and Theorem 4. Theorem 1 is a brief version of Theorem 2, so we omit the proof.
Proof of Theorem 3.
Proof of Theorem 2.
Let , , and . By Theorem 4, in the event ,
(19) 
where and .
We first derive some probabilistic bounds for some useful quantities. Since and the rows of follow a multivariate normal distribution with covariance matrix , we have and . Thus, follows a distribution with degrees of freedom and thus
(20) 
Also, we have that follows a distribution with degrees of freedom. We use Lemma 1 in Sun and Zhang (2011) with and to obtain
Thus,
(21) 
i.e. the events occur with probability greater than . Since , we have
(22) 
So there exists a small , such that holds for all with probability greater than .
Now we need to bound , for all , with probability greater than under the given conditions, where . Let . We discuss the bounds for within the event .
For with , we define
For any subset , we have
(23) 
By Proposition 2(i) in Zhang and Huang (2008), we have
(24) 
where . We also have
(25)  
(26) 
Let . It follows from the shifting inequality in Ye and Zhang (2010) with that
The second and the third inequalities follow from (23) and (25), respectively. Let in (24) with . Then
Under the condition for a small fixed , is also very small. Thus, with probability greater than , are bounded by for all , where is a constant only depending on .
Comments
There are no comments yet.