 # An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

The estimation of high dimensional precision matrices has been a central topic in statistical learning. However, as the number of parameters scales quadratically with the dimension p, many state-of-the-art methods do not scale well to solve problems with a very large p. In this paper, we propose a very efficient algorithm for precision matrix estimation via penalized quadratic loss functions. Under the high dimension low sample size setting, the computation complexity of our algorithm is linear in the sample size and the number of parameters, which is the same as computing the sample covariance matrix.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Precision matrix plays an important role in statistical analysis. Under Gaussian assumptions, the precision matrix is often used to study the conditional independence among the random variables. Contemporary applications usually require fast methods for estimating a very high dimensional precision matrix

(Meinshausen and Bühlmann, 2006). Despite recent advances, estimation of the precision matrix remains challenging when the dimension is very large, owing to the fact that the number of parameters to be estimated is of order . For example, in the Prostate dataset we are studying in this paper, 6033 genetic activity measurements are recorded for 102 subjects. The precision matrix to be estimated is of dimension , resulting in more than 18 million parameters.

One of the well-known and popular method in precision matrix estimation is referred to as the graphical lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008). Without loss of generality, assume that

are i.i.d. observations from a p-dimensional Gaussian distribution with mean

and covariance matrix . To estimation the precision matrix , the graphical lasso solves the following regularized negative log-likelihood:

 argminΩ∈Optr% (SΩ)−log|Ω|+λ∥Ω∥1, (1)

where is the set of non-negative definite matrices in , is the sample covariance matrix, denotes the sum of the absolute values of and is the tuning parameter. Although (1) is constructed based on Gaussian likelihood, it is known that glasso also works for non-Gaussian data (Ravikumar et al., 2011). Many algorithms have been developed to solve the graphical lasso. Friedman et al. (2008) proposed a coordinate descent procedure and Boyd et al. (2011) provided an alternating direction method of multipliers (ADMM) algorithm for solving (1). In order to obtain faster convergence for the iterations, second order methods and proximal gradient algorithms on the dual problem are also well developed; see for example Hsieh et al. (2014), Dalal and Rajaratnam (2017), and the references therein. However, since the graphical lasso involves matrix determinant, eigen-decomposition or calculation of the determinant of a matrix is inevitable in these algorithms. Note that the computation complexity of eigen-decomposition or matrix determinant is of order . Thus, the computation time for these algorithms will scale up cubically in .

Recently, Zhang and Zou (2014) and Liu and Luo (2015) proposed to estimate

by some trace based quadratic loss functions. Using Kronecker product and matrix vectorization, our interest is to estimate,

 vec(Ω∗)=vec(Σ−1)=vec(I⋅I⋅Σ−1)=(Σ⊗I)−1vec(I), (2)

or equivalently,

 vec(Ω∗)=vec(Σ−1)=vec(Σ−1⋅I⋅I)=(I⊗Σ)−1vec(I), (3)

where

denotes the identity matrix and

is Kronecker product. Motivated by (2) and LASSO (Tibshirani, 1996), a natural way to estimate is

 argminβ∈Rp212βT(S⊗I)β−βTvec(I)+λ∥β∥1. (4)

To obtain a symmetric estimator we can use both (2) and (3), and estimate by

 argminβ∈Rp214βT(S⊗I+I⊗S)β−βTvec(I)+λ∥β∥1. (5)

In matrix notation and writing , (4) can be written as

 ^Ω1=argmin12tr(ΩSΩT)−tr(Ω)+λ∥Ω∥1. (6)

Symmetrization can then be applied to obtain a final estimator. The loss function

 L1(Ω):=12tr(ΩSΩT)−tr(Ω) (7)

is used in Liu and Luo (2015) and they proposed a column-wise estimation approach called SCIO. Similarly, the matrix form of (5) is

 ^Ω2=argmin14tr(ΩTSΩ)+14tr(ΩSΩT)−tr(Ω)+λ∥Ω∥1. (8)

The loss function

 L2(Ω)=12{L1(Ω)+L1(ΩT)}=14tr(ΩSΩT)+14tr(ΩTSΩ)−tr(Ω),

is equivalent to the D-trace loss proposed by Zhang and Zou (2014), owing to the fact that naturally force the solution to be symmetric.

In the original papers by Zhang and Zou (2014) and Liu and Luo (2015), the authors have established consistency results for the estimation (6) and (8) and show that their performance are comparable to graphical lasso. As can be seen in the vectorized formulation (2) and (3), the loss functions and are quadratic in . In this note, we show that under such quadratic loss functions, explicit solutions can be obtained in each step of the ADMM algorithm. In particular, we derived explicit formulations for the inverses of and for any given , from which we are able to solve (6) and (8), or equivalently (4) and (5), with computation complexity of order . Such a rate is in some sense optimal, as the complexity for computing is also of order .

## 2 Main Results

For any real matrix , we shall use to denote its Frobenius norm, and

to denote its Spectral norm, i.e., the square root of the largest eigenvalue of

. For the quadratic loss function or , the augmented Lagrangian is

 La(Ω,A,B)=argminL(Ω)+ρ/2∥Ω−A+B∥22+λ∥A∥1,

where is the step size of ADMM. By Boyd et al. (2011), the alternating iterations are

 Ωk+1 =argminLa(Ω,Ak,Bk), Ak+1 =argminLa(Ωk+1,A,Bk)=soft(Ωk+1+Bk,λ/ρ), Bk+1 =Ωk+1−Ak+1+Bk,

where is an element-wise soft thresholding operator. Clearly the computation complexity will be mainly dominated by the update of , which amount to solving the following problem:

 argminΩ∈Rp×pL(Ω)+ρ/2∥Ω−Ak+Bk∥22.

From the convexity of the objective function, we have, Consequently, for the estimation (6) and (8), we need to solve the following equations respectively,

 SΩ+ρΩ =I−ρ(Ak−Bk), (9) 2−1SΩ+2−1ΩS+ρΩ =I−ρ(Ak−Bk). (10)

By looking at the vectorized formulations (4) and (5), we immediately have that in order to solve (9) and (10) we need to compute the inverses of and . The following proposition provides explicit expressions for the inverses.

###### Proposition 1

Write the decomposition of as where and . For any , we have

 (S+ρI)−1=ρ−1I−ρ−1UTΛ1U,   and (2−1S⊗I+2−1I⊗S+ρI)−1 = ρ−1I−ρ−1(UΛ1UT)⊗I−ρ−1I⊗(UΛ1UT)+ρ−1(U⊗U)diag{vec(Λ2)}(UT⊗UT),

where , and .

The main technique used for deriving the above proposition is the well-known Woodbury matrix identity. The derivation involves lengthy and tedious calculations and is hence omitted. Nevertheless, one can simply verify the proposition by showing for example . By Proposition 1, and the following equations:

 {(UΛ1UT)⊗I}%vec(C)=vec(CUΛ1UT), I⊗(UΛ1UT)vec(C)=vec(UΛ1UTC), (U⊗U)diag{vec(Λ2)}(UT⊗UT)vec(C)=U{Λ2∘(UTCU)}UT,

where denotes the Hadamard product, we can obtain the following theorem:

###### Theorem 1

For a given ,

• the solution to the equation is unique and is given as ;

• the solution to the equation is unique and is given as

Note that when is the the sample covariance, and

can be obtained from the singular value decomposition (SVD) of

whose complexity is of order . On the other hand, the solutions obtained in Theorem 1 involves only elementary matrix operations of and matrices and thus the complexity for solving (9) and (10) can be seen to be of order .

###### Remark 1

Generally, we can specify different weights for each element and consider the estimation

 ^Ωk=argminLk(Ω)+λ∥W∘Ω∥1,k=1,2,

where . For example,

• Setting and where the diagonal elements are left out of penalization;

• Using the local linear approximation (Zou and Li, 2008), we can set where is a LASSO solution and is a general penalized function such as SCAD or MCP.

The ADMM algorithm will be the same as the penalized case except the related update is replaced by a element-wise soft thresholding with different thresholding parameters.

###### Remark 2

Compared with the ADMM algorithm given in Zhang and Zou (2014), our update of only involves matrix operations of some and matrices, while matrix operations on some matrices are required in Zhang and Zou (2014); see for example Theorem 1 in Zhang and Zou (2014). Consequently, we are able to obtain the order in these updates while Zhang and Zou (2014) requires . Our algorithm thus scales much better when .

Similarly, for the graphical lasso, we can also use ADMM (Boyd et al., 2011) to implement the minimization where the loss function is The update for is obtained by solving Denote the eigenvalue decomposition of as where , we can obtain a closed form solution,

 Ωk+1=QTdiag⎧⎪ ⎪⎨⎪ ⎪⎩a1+√a21+4ρ2ρ,⋯,ap+√a2p+4ρ2ρ⎫⎪ ⎪⎬⎪ ⎪⎭Q.

Compared with the algorithm based on quadratic loss functions, the computational complexity is dominated by the eigenvalue decomposition of matrices.

## 3 Simulations

In this section, we conduct several simulations to illustrate the efficiency and estimation accuracy of our proposed methods. We consider the following two precision matrices:

 Ω1=(0.5|i−j|)p×p,  Ω2=Ω−11=13⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝4−2−25−2⋱⋱⋱−25−2−24⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

where is an asymptotic sparse matrix with many small elements and is a sparse matrix where each column/row only has at most three non-zero elements. For all of our simulations, we set the sample size and generate the data from with or .

For notation convenience, we shall use the term “EQUAL” to denote our proposed Efficient admm algorithm via the QUAdratic Loss (4), and similarly, use “EQUALs” to denote the estimation based on the symmetric quadratic loss (5). An R package “EQUAL” has been developed to implement our methods. For comparison, we consider the following competitors:

• CLIME (Cai et al., 2011) which is implemented by the R package “fastclime”;

• glasso (Friedman et al., 2008) which is implemented by the R package “glasso”;

• BigQuic (Hsieh et al., 2013) which is implemented by the R package “BigQuic”;

• glasso-ADMM which solves the glasso by ADMM (Boyd et al., 2011);

• SCIO (Liu and Luo, 2015) which is implemented by the R package “scio”;

• D-trace (Zhang and Zou, 2014) which is implemented using the ADMM algorithm provided in the paper.

Table 1 summaries the computation time in seconds based on 10 replications where all methods are implemented in R with a PC with 3.3 GHz Intel Core i7 CPU and 16GB memory. For all the methods, we solve a solution path corresponding to 50 lambda values ranging from to . Here is the maximum absolute elements of the sample covariance matrix. Although the stopping criteria is different for each method, we can see from Table 1 the computation advantage of our methods. In particularly, our proposed algorithms are much faster than the original quadratic loss based methods “SCIO” or “D-trace” for large . In addition, we can roughly observe that the required time increases quadratically in in our proposed algorithms.

The second simulation is designed to evaluate the performance of estimation accuracy. Given the true precision matrix and an estimator , we report the following four loss functions:

 loss1 =1√p∥Ω−^Ω∥2,   % loss2=∥Ω−^Ω∥,   loss3=√1p{% tr(Ω−1^Ω)−log|Ω−1^Ω|−p}, loss4 =√1p{tr(^ΩTΩ−1^Ω)/2−tr(^Ω)+%tr(Ω)},

where loss1 is the scaled Frobenius loss, loss2 is the spectral loss, loss3 is the Stein’s loss which is related to the Gaussian likelihood and loss4 is related to the quadratic loss.

Table 2 reports the simulation results where the tuning parameter is chosen by five-fold cross-validations. We can see that the performance of all three estimators are comparable, indicating that the penalized quadratic loss estimator is also reliable for high dimensional precision matrix estimation. We also observe that the estimator based on (5) has slightly smaller estimation error than those based on (4). As shown in Table 1 , the computation for quadratic loss estimator are much faster than glasso. Moreover, to check the singularity of the estimation, we also report the minimum eigenvalue for each estimation in the final column of Table 2. We observe that all estimators are well estimated to be positive definite.

Finally, we apply our proposal to the Prostate dataset which is publicly available at https://web.stanford.edu/~hastie/CASI_files/DATA/prostate.html. The data records 6033 genetic activity measurements for the control group (50 subjects) and the prostate cancer group (52 subjects). Here, the data dimension is 6033 and the sample size is 50 or 52. We estimate the precision for each group. Since our EQUAL and EQUALs give similar results, we only report the estimation for EQUALs. It took less than 20 minutes for EQUALs to obtain the solution paths while “glasso” can not produce the solution due to out of memory in R. The sparsity level of the solution paths are plotted in the upper panel of Figure 1. To compare the precision matrices between the two groups, the network graphs of the EQUALs estimators with tuning are provided in the lower panel of Figure 1. Figure 1: Estimation for the Prostate data using EQUALs. Upper panel: sparsity level (average number of non-zero elements for each row/column) versus λ. Lower panel: network graphs for the two patient groups when λ=0.75.

## References

• Banerjee et al.  O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data.

The Journal of Machine Learning Research

, 9(Mar):485–516, 2008.
• Boyd et al.  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
• Cai et al.  T. Cai, W. Liu, and X. Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
• Dalal and Rajaratnam  O. Dalal and B. Rajaratnam. Sparse gaussian graphical model estimation via alternating minimization. Biometrika, 104(2):379–395, 2017.
• Friedman et al.  J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
• Hsieh et al.  C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. In Advances in neural information processing systems, pages 3165–3173, 2013.
• Hsieh et al.  C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. QUIC: quadratic approximation for sparse inverse covariance estimation. The Journal of Machine Learning Research, 15(1):2911–2947, 2014.
• Liu and Luo  W. Liu and X. Luo. Fast and adaptive sparse precision matrix estimation in high dimensions.

Journal of Multivariate Analysis

, 135:153–162, 2015.
• Meinshausen and Bühlmann  N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436–1462, 2006.
• Ravikumar et al.  P. Ravikumar, M. J. Wainwright, G. Raskutti, B. Yu, et al. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
• Tibshirani  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, pages 267–288, 1996.
• Yuan and Lin  M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
• Zhang and Zou  T. Zhang and H. Zou. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika, 101(1):103–120, 2014.
• Zou and Li  H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509, 2008.