# Sparse Matrix Inversion with Scaled Lasso

We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ_1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ_1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso.

## Authors

• 4 publications
• 30 publications
04/24/2011

### Scaled Sparse Linear Regression

Scaled sparse linear regression jointly estimates the regression coeffic...
12/11/2020

### Scaling positive random matrices: concentration and asymptotic convergence

It is well known that any positive matrix can be scaled to have prescrib...
03/28/2022

### An efficient GPU-Parallel Coordinate Descent Algorithm for Sparse Precision Matrix Estimation via Scaled Lasso

The sparse precision matrix plays an essential role in the Gaussian grap...
09/11/2019

### Insights and algorithms for the multivariate square-root lasso

We study the multivariate square-root lasso, a method for fitting the mu...
09/01/2020

### Diagonal scalings for the eigenstructure of arbitrary pencils

In this paper we show how to construct diagonal scalings for arbitrary m...
07/01/2016

### A scaled Bregman theorem with applications

Bregman divergences play a central role in the design and analysis of a ...
04/09/2020

### Sparse recovery of noisy data using the Lasso method

We present a detailed analysis of the unconstrained ℓ_1-method Lasso met...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the estimation of the matrix inversion satisfying , given a data matrix . When is a sample covariance matrix, our problem is the estimation of the inverse of the corresponding population covariance matrix. The inverse covariance matrix is also called precision matrix or concentration matrix. With the dramatic advances in technology, the number of covariates is of greater order than the sample size in many statistical and engineering applications. In this case, the sample covariance matrix is always singular and thus it is difficult to compute the precision matrix. In such cases, a certain type of sparsity condition is required for proper estimation the precision matrix and for theoretical investigation of the estimation problem. In this paper, we will impose for simplicity an (maximum degree) sparsity condition on the target inverse matrix .

Many approaches have been proposed to estimate the sparse inverse matrix in the high dimensional setting. The penalization is one of the most popular methods. Lasso-type methods, or convex minimization algorithms with the penalty on all entries of , have been discussed by Banerjee, El Ghaoui and d’Aspremont (2008), Friedman, Hastie and Tibshirani (2008) and more, and by Yuan and Lin (2007) with penalization on the off-diagonal matrix only. This is refereed to as the graphical Lasso (GLasso) due to the connection of the precision matrix to Gaussian Markov graphical models. In this GLasso framework, Rothman, Bickel, Levina and Zhu (2008) proved the convergence rate in the Frobenius norm and in the spectrum norm, where is the number of nonzero entries in the off-diagonal matrix. Ravikumar, Wainwright, Raskutti, and Yu (2008) provided sufficient conditions for model selection consistency of this -regularized MLE. Lam and Fan (2009) studied on a general penalty function and achieved a sharper bound of order under the Frobenius norm for the penalty. Since the spectrum norm can be controlled via the Frobenius norm, this provides a sufficient condition for the convergence under the spectrum norm to the unknown precision matrix. This is a very strong condition since is of the order for banded precision matrices, where is the matrix degree, i.e. the largest number of nonzero entries in the columns.

Some recent work suggests a weaker sufficient condition with the matrix degree. Yuan (2010) estimated each column of the inverse matrix by Dantzig selector and then seek a symmetric matrix close to the column estimation. When norm of the precision matrix is bounded, this method can achieve a convergence rate of order based on several matrix norms. The CLIME estimator, introduced by Cai, Liu and Luo (2011), has the same order of convergence rate, which uses the plug-in method with Dantzig selector to estimate each column, but followed by a simpler symmetrization step. They also require the boundedness of the norm of the unknown. In Yang and Kolaczyk (2010), the Lasso is applied to estimate the columns of the target matrix under the assumption of equal diagonal, and the estimation error is studied in the Frobenius norm for . This column-by-column idea reduces a graphical model to a regression model. It was first introduced in Meinshausen and Bülhmann (2006) for identifying nonzero variables in a graphical model, called neighborhood selection.

In this paper, we propose to apply the scaled Lasso (Sun and Zhang, 2011) column-by-column to estimate a precision matrix in the high dimensional setting. Based on the connection of precision matrix to linear regression by the block inversion formula, we construct a column estimator with the scaled Lasso, a joint estimator for the regression coefficients and noise level. Since we only need the sample covariance matrix in our procedure, this estimator could be extended to generate an approximate inverse of a nonnegative data matrix in a general setting. This scaled Lasso algorithm provides a fully specified map from the space of nonnegative-definite matrices to the space of symmetric matrices. For each column, the penalty level of the scaled Lasso is determined by data via convex minimization, without using cross-validation.

We study theoretical properties of the proposed estimator for a precision matrix under a normality assumption. More precisely, we assume that the data matrix is the sample covariance matrix , where the rows of are iid . Under conditions on the spectrum norm and degree of the inverse of , we prove that the proposed estimator guarantees the rate of convergence of order in the spectrum norm. The conditions are weaker than those in the existing analyses of other algorithms, which typically require the boundedness of the norm. When the norm of the target matrix diverges to infinity, the analysis of the proposed estimator guarantees a faster convergence rate than that of the existing literature. We state this main result of the paper in the following theorem.

###### Theorem 1

Let be the scaled Lasso estimator, defined in (5), (6) and (7) below, based on iid observations from . Let and

be the smallest and largest eigenvalues of correlation matrix of

, be the inverse of and be the maximum degree of . Suppose that , the diagonal entries of the target matrix are uniformly bounded, is bounded from 0 and for a small fixed . Then, the spectrum norm of the estimation error is bounded by

 ∥ˆΘ−Θ∗∥2=OP(d√(logp)/n).

The convergence of the proposed scaled Lasso estimator under the sharper spectrum norm condition on , instead of the stronger bounded condition, is not entirely technical. It is a direct consequence of the faster convergence rate of the scaled Lasso estimator of the noise level in linear regression. To the best of our knowledge, it is unclear if other algorithms also achieve this fast convergence rate, either for the estimation of the noise level in linear regression or for the estimation of a precision matrix under the spectrum norm. However, it is still possible that this difference between the scaled Lasso and other methods is due to potentially coarser specification of the penalty level in other algorithms (e.g. cross validation) or a less accurate error bound in other analyses.

The rest of the paper is organized as follows. In Section 2, we present the estimation for the inversion of a nonnegative definite matrix via the scaled Lasso. In Section 3, we study error bounds of the proposed estimator for precision matrix. Simulation studies are presented in Section 4. In Section 5, we discuss oracle inequalities for the scaled Lasso with unnormalized predictors and the estimation of inverse correlation matrix. Section 6 includes all the proofs.

We use the following notation throughout the paper. For a vector

, is the norm with the special and the usual extensions and . For matrices , is the -th column of , represents the submatrix of with rows in and columns in , is the matrix norm. In particular, is the spectrum norm for symmetric matrices. Moreover, we denote the set by and denote the set by in the subscripts.

## 2 Matrix inversion via scaled Lasso

Let be a nonnegative-definite data matrix and be a positive-definite target matrix with . In this section, we describe the relationship between positive-definite matrix inversion and linear regression and propose an estimator for via scaled Lasso, a joint convex minimization for the estimation of regression coefficients and noise level.

We use scaled Lasso to estimate column by column. Define and by

 σ2j=(Θ∗jj)−1,β∗,j=−Θ∗∗,jσ2j=−Θ∗∗,j(Θ∗jj)−1. (1)

In the matrix form, we have the following relationship

 diagΘ∗=diag(σ−2j,j=1,…,p),Θ∗=−β(diagΘ∗). (2)

Let . Since at , one may estimate the -th column of by minimizing the penalized quadratic loss. In order to shrink the estimation coefficients on the same scale, we adjust the penalty function with a normalizing factor, which leads to the penalized quadratic loss as follows,

 b′¯¯¯¯Σb/2+λp∑k=1¯¯¯¯Σ1/2kk|bk|

subject to . This is actually the Lasso for a linear regression model with normalized preditors. In practice, we first normalize the predictors by the weights and then the minimization problem can be solved by algorithms for the Lasso estimation. This is similar to Yuan (2010) and Cai, Liu and Luo (2011) who used the Dantzig selector to estimate each column. However, one still needs to choose a penalty level and to estimate to recover via (2). A solution to resolve these two issues is the scaled Lasso (Sun and Zhang, 2011):

 (3)

where with a fixed . The scaled Lasso (3) is a solution of joint convex minimization over due to the convexity in (Huber, 2009; Antoniatis, 2010). Since ,

 diag(β′Σ∗β)=(diagΘ∗)−1=% diag(σ2j,j=1,…,p).

Thus, (3) is expected to yield consistent estimates of .

Sun and Zhang (2011) provided an iterative algorithm to compute the scaled Lasso estimator (3). We rewrite the algorithm in the form of matrices. For each , the Lasso path is given by the estimates satisfying the following KKT conditions, for all ,

 ⎧⎪⎨⎪⎩¯¯¯¯Σ−1/2kk¯¯¯¯Σk,∗ˆβ∗,j(λ)=−λ\rm sgn(ˆβk,j(λ)),ˆβk,j≠0,¯¯¯¯Σ−1/2kk¯¯¯¯Σk,∗ˆβ∗,j(λ)∈λ[−1,1],ˆβk,j=0, (4)

where . Based on the Lasso path , the scaled Lasso estimator is computed iteratively by

 ˆσ2j←ˆβ′∗,j¯¯¯¯Σˆβ∗,j,λ←ˆσjλ0,ˆβ∗,j←ˆβ∗,j(λ). (5)

Here the penalty level of the Lasso is determined by the data without using cross-validation. We then simply take advantage of the relationship (2) and compute the coefficients and noise levels by the scaled Lasso for each column

 diag˜Θ=diag(ˆσ−2j,j=1,…,p),˜Θ=−ˆβ(diag˜Θ). (6)

It is noticed that a good estimator for should be a symmetric matrix. However, the estimator does not have to be symmetric. We improve this estimator by using a symmetrization step as in Yuan (2010),

 ˆΘ=argminM:MT=M∥M−˜Θ∥1, (7)

which can be solved by linear programming. Alternatively, semidefinite programming, which is somewhat more expensive computationally, can be used to produce a nonnegative definite

in (7). According to the definition, the new estimator has the same error rate as . A nice property for symmetric matrix is that the spectrum norm is bounded by the matrix norm. The matrix norm can be given more explicitly as the maximum norm of the columns, while the matrix norm is the maximum norm of the rows. Hence, for any symmetric matrix, the matrix norm is equivalent to the matrix norm, so the spectrum norm can be bounded by either of them. Since both our estimator and the target matrix are symmetric, the error bound based on the spectrum norm could be studied by bounding the error, as typically done in the existing literature. We will discuss these error bounds in Section 3.

To sum up, we propose to estimate the matrix inversion by (5), (6) and (7). The iterative algorithm (5) computes the regression coefficients and noise level based on a Lasso path determined by (4). Then (6) translates the resulting estimators of (5) to column estimators and thus a preliminary matrix estimator is constructed. Finally, the symmetrization step (7) produces a symmetric estimate for our target matrix.

## 3 Error bounds for precision matrix

In this section, we study the error for the inverse of a covariance matrix, which is our primary example of the target matrix. From now on, we suppose that the data matrix is the sample covariance matrix , where the rows of are iid , and the target matrix is .

Let and be the smallest and the largest eigenvalues of the correlation matrix . Define and the degree of the matrix

 d=deg(Θ∗)=maxj|Sj|+1.

The following theorem gives the convergence rate based on the matrix norm ( matrix norm) and spectrum norm.

###### Theorem 2

Let and with . Suppose that for a small fixed

. Then with probability greater than

,

 ∥ˆΘ−Θ∗∥2 ≤ ∥ˆΘ−Θ∗∥1 = ∥ˆΘ−Θ∗∥∞ (8) ≤ C1λ20d∥Θ∗∥1ρ−1∗+C2λ0dmaxΘ∗kkρ−1∗

where and are constants depending on only.

Since the entries of are bounded by the maximum of the diagonal, the matrix norm is of the same order as the matrix degree . Thus, the inequality (8) provides a convergence rate of the order for either the matrix norm or the spectrum norm under the conditions , and . The first condition is the main sparsity condition, and the other two are actually conditions on the norm of the target matrix. To achieve the same convergence rate, Yuan (2010) and Cai, Liu and Luo (2011) both imposed the condition and the boundedness of the norm of the unknown. We replace the condition by the weaker boundedness of the spectrum norm of the unknown. The spectrum norm condition on the unknown is not only weaker, but also natural for the convergence in spectrum norm. The extra condition here is not strong. Under the conditions and , the extra condition only requires to be small and it allows to diverge to infinity.

This sharper error bound in the spectrum norm is a consequence of using the scaled Lasso estimator (3). Sun and Zhang (2011) gave a convergence rate of order for the scaled Lasso estimation of the noise levels . With this faster rate of convergence, the estimation error in the diagonal is no longer the main term and thus the condition of the bounded norm of can be weakened.

The consistency of the scaled Lasso estimation for the noise level is based on the error bound for the regression coefficients. Oracle inequalities for the error of the Lasso have been studied with various conditions, including the restricted isometry condition (Candes and Tao, 2007), the compatibility condition (van de Geer, 2007) and the sign-restricted cone invertibility factor (Ye and Zhang, 2010) among others. Sun and Zhang (2011) extended these oracle inequalities for the scaled Lasso. Here we use the version under the condition of sign-restricted cone invertibility factor (SCIF)

 SCIF1(ξ,S;Σ)=inf{|S|⋅∥Σu∥∞∥u∥1:u∈C−(ξ,S)}>0, (9)

with the cone and the sign-restricted cone . It is proved that, conditional on ,

 ∣∣ˆσjσj−1∣∣=Op(1)|Sj|λ20,∥ˆβ−j,j−β−j,j∥1/σj=Op(1)|Sj|λ0, (10)

under the condition that is bounded away from 0. This is guaranteed by the conditions of Theorem 2. The error bound of matrix norm then follows from (10).

## 4 Numerical study

In this section, we compare the proposed matrix estimator based on scaled Lasso with graphical Lasso and CLIME (Cai, Liu and Luo, 2011). Three models are considered. The first two models are the same as model 1 and model 2 in Cai, Liu and Luo (2011). Model 2 was also studied in Rothman et al. (2008).

• Model 1: .

• Model 2: Let , where each off-diagonal entry in is generated independently and equals to 0.5 with probability 0.1 or 0 with probability 0.9. is chosen such that the condition number of is . Finally, we rescale the matrix to the unit in diagonal.

• Model 3: The diagonal of the target matrix has unequal values. , where and is a diagonal matrix with diagonal elements .

For each model, we generate a training sample of size 100 from a multivariate normal distribution with mean zero and covariance matrix

and an independent sample of size 100 from the same distribution for validating the tuning parameter for the graphical Lasso and CLIME. The GLasso and CLIME estimators are computed based on training data with various ’s and we choose by minimizing likelihood loss on the validation sample. The proposed scaled Lasso estimator is computed based on the training sample alone with the penalty level . Consider 6 different dimensions and replicate 100 times for each case. The CLIME estimators for and are not computed due to the computational costs.

Table 1 presents the mean and standard deviation of estimation errors based on 100 replications. The estimation error is measured by several matrix norms: spectrum norm, matrix

norm and Frobenius norm. We can see that scaled Lasso estimator, labelled as SLasso, outperforms the graphical Lasso (GLasso) in all cases, while it has a comparable peformance with the CLIME.

## 5 More results

### 5.1 Oracle inequalities for scaled Lasso estimator

In the proof of the theoretical results for the proposed estimator, we use oracle inequalities for the estimation error associated with a linear model without normalizing the predictors. In the discussion section, we describe this aspect of our results. Consider a linear model as follows,

 y=Xβ+ε,ε∼N(0,σ2In).

Let , , and . In order to penalize the coefficients on the same scale, we use a weighted norm of the coefficients as the penalty function. Consider the estimator

 {ˆβ,ˆσ}=argminb,σ{∥y−Xb∥22nσ+σ2+λ0∥D1/2b∥1}. (11)

This is actually the scaled Lasso as we use in matrix estimation in Section 2. It is equivalent to the estimation based on normalized predictors:

 {ˆα,ˆσ}=argmina,σ{∥y−˜Xa∥22nσ+σ2+λ0∥a∥1} (12)

with .

The following theorem gives the oracle inequalities for the estimation of regression coefficients and noise level.

###### Theorem 3

Let be as in (11) and (12), , , and .
(i)In the event ,

 11+τ+≤(ˆσσ∗)2≤11−τ−,∥ˆα−α∥1≤(1+ξ)τ−σ∗2ξλ0(1−τ−)1/2, (13) ∥ˆβ−β∥1≤(1+ξ)τ−σ∗2ξλ0(1−τ−)1/2minkD1/2kk, (14)

where and with constants , and .
(ii)Let . For ,

 P{z∗≤(1−τ−)λ0(ξ−1)/(ξ+1)}≥1−(1+o(1))ϵ/√πlog(p/ϵ).

Theorem 3 is an immediate extension from the oracle inequalities for the scaled Lasso in Sun and Zhang (2011). With an extra condition that are uniformly bounded from zero, the estimators have the same convergence rate as that for a regression model with normalized predictors. The error rates (10) follows from Theorem 3 and are used to prove the convergence rate of matrix estimation.

### 5.2 Error bounds for inverse correlation matrix

Given the data matrix , we are also interested in estimating the inverse correlation matrix

 Ω∗={D−1/2Σ∗D−1/2}−1=D1/2(Σ∗)−1D1/2.

where . When constructing the matrix estimator , we solve the linear regression problem with normalized predictors. If the estimator in (6) is replaced by as in (12), the resulting matrix is an estimator for . Thus, the inverse correlation matrix is estimated by

 ˜Ω = −ˆαdiag(ˆσ−2j¯¯¯¯Σ1/2jj,j=1,…,p), (15) ˆΩ = argminM:MT=M∥M−˜Ω∥1. (16)

The error bounds of this estimator are well established as follows.

###### Theorem 4

Let , with and . Suppose that for a small fixed . Then with probability greater than ,

 ∥ˆΩ−Ω∗∥2 ≤ ∥ˆΩ−Ω∗∥1 = ∥ˆΩ−Ω∗∥∞ (17) ≤ C1λ20d∥Ω∥1∥Ω∥2+C2λ0∥Ω∥1+C3λ0d∥Ω∥2maxΩ1/2jj (18)

where , and are constants depending on only.

Theorem 4 shows that the error bounds are also of the order under proper conditions. The crucial condition here is still the boundedness of the spectrum norm of the unknown matrix.

## 6 Proofs

In this section, we provide the proofs of Theorem 3, Theorem 2 and Theorem 4. Theorem 1 is a brief version of Theorem 2, so we omit the proof.

#### Proof of Theorem 3.

The inequalities (13) are parallel to Theorem 2 in Sun and Zhang (2011). The only difference is that here we use the bound under the condition of the sign-restricted cone invertibility factor (SCIF). Since , (14) follows from the second inequality in (13).

#### Proof of Theorem 2.

Let , , and . By Theorem 4, in the event ,

 11+τ+(j)≤(ˆσjσ∗j)2≤11−τ−(j),∑k≠j¯¯¯¯Σ1/2kk|ˆβk,j−βk,j|≤(1+ξ)τ−(j)σ∗j2ξλ0(1−τ−(j))1/2, (19)

where and .

We first derive some probabilistic bounds for some useful quantities. Since and the rows of follow a multivariate normal distribution with covariance matrix , we have and . Thus, follows a distribution with degrees of freedom and thus

 P{∣∣(σ∗j/σj)2−1∣∣>√(8/n)log(2p/ϵ)}≤ϵ/p. (20)

Also, we have that follows a -distribution with degrees of freedom. We use Lemma 1 in Sun and Zhang (2011) with and to obtain

 P{|z(j),k|>√2log(p2/ϵ)/n} ≤ (1+ϵn−1)(ϵ/p2)/√πlog(p2/ϵ).

Thus,

 P{maxj|z∗(j)|>√2log(p2/ϵ)/n} ≤ ϵ, (21)

i.e. the events occur with probability greater than . Since , we have

 (22)

So there exists a small , such that holds for all with probability greater than .

Now we need to bound , for all , with probability greater than under the given conditions, where . Let . We discuss the bounds for within the event .

For with , we define

 δ±a=δ±a(X)=maxA,u{±(∥X′AXAu/n∥−1)}, θ(2)a,b=θ(2)a,b(X)=maxA,B,u,vv′X′AXBu/n.

For any subset , we have

 θ(2)a,b≥θ(2)a,b(XT),δ±a≥δ±a(XT),θ(2)a,b≤(1+δ+a)1/2(1+δ+b)1/2≤1+δ+a∨b. (23)

By Proposition 2(i) in Zhang and Huang (2008), we have

 P{(1−c)2ρ∗≤1−δ−m(Z)≤1+δ+m(Z)≤(1+c)2ρ∗}≥1−ϵ, (24)

where . We also have

 1+δ+a(˜X)≤max(Σ∗kk/¯¯¯¯Σkk)(1+δ+a(Z))=(1+δ+a(Z))/(1−ζ), (25) 1−δ−a(˜X)≥min(Σ∗kk/¯¯¯¯Σkk)(1−δ−a(Z))=(1−δ−a(Z))/(1+ζ). (26)

Let . It follows from the shifting inequality in Ye and Zhang (2010) with that

 SCIF1(ξ,Sj;˜Σ−j) ≥ 11+ξ(1−δ−kj+ℓ(˜X−j)−ξ√kj4ℓθ(2)4ℓ,kj+ℓ(˜X−j)) ≥ 11+ξ{1−δ−4ℓ(˜X)−ξ√d4ℓ(1+δ+4ℓ(˜X))} ≥ 11+ξ{1−δ−4ℓ(Z)1+ζ−ξ√d4ℓ1+δ+4ℓ(Z)1−ζ}

The second and the third inequalities follow from (23) and (25), respectively. Let in (24) with . Then

 SCIF1(ξ,Sj;˜Σ−j)≥ρ∗1+ξ{(1−c)21+ζ−(1+c)22(1−ζ)}.

Under the condition for a small fixed , is also very small. Thus, with probability greater than , are bounded by for all , where is a constant only depending on .

Now we are ready to bound of the column of by (19), (20), (21), (22) and the uniform bound for . The following inequalities hold with probability greater than :

 ∥˜Θ⋅j−Θ∗⋅j∥1 ≤ |˜Θjj−Θ∗jj|+∥˜Θ−j,j−Θ∗−j,j∥1 ≤ ∥Θ⋅j∥1⋅∣∣(ˆσjσj)−2−1∣∣+(ˆσjσ∗j)−2(σ∗jσj)−1σ−1j∥ˆβ−j,j−β−j,j∥1σ∗j ≤ ∥Θ⋅j∥1C′1λ20|Sj|ρ∗+C′2λ0|Sj|<