 # A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions

We propose an algorithmic framework for convex minimization problems of a composite function with two terms: a self-concordant function and a possibly nonsmooth regularization term. Our method is a new proximal Newton algorithm that features a local quadratic convergence rate. As a specific instance of our framework, we consider the sparse inverse covariance matrix estimation in graph learning problems. Via a careful dual formulation and a novel analytic step-size selection procedure, our approach for graph learning avoids Cholesky decompositions and matrix inversions in its iteration making it attractive for parallel and distributed implementations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sparse inverse covariance matrix estimation is a key step in graph learning. To understand the setup, let us consider learning a Gaussian Markov random field (GMRF) of nodes/variables from a dataset , where is a

-dimensional random vector, drawn from the Gaussian distribution

. Let be the inverse covariance (or the precision) matrix for the model. To satisfy the conditional independencies with respect to the GMRF, must have zero in corresponding to the absence of an edge between nodes and (Dempster, 1972).

To learn the underlying graph structure from , one can use the empirical covariance matrix . Unfortunately, this approach is fundamentally ill-posed since the empirical estimates converge to the true covariance at a -rate (Dempster, 1972). Hence, inferring the true graph structure accurately requires an overwhelming number of samples. Unsurprisingly, we usually have fewer samples than the ambient dimension, compounding the difficulty of estimation.

While the possible GMRF structures are exponentially large, the most interesting graphs are rather simple with a sparse set of edges. Provable learning of such graphs can be achieved by -norm regularization in the maximum log-likelihood estimation:

 Θ∗∈argminΘ≻0{−logdet(Θ)+tr(ˆΣΘ)=:f(Θ)+ρ∥vec(Θ)∥1=:g(Θ)}, (1)

where is a parameter to balance the fidelity error and the sparsity of the solution and is the vectorization operator. Here, corresponds to the empirical log-likelihood and is the sparsity-promoting term. Under this setting, the authors in (Ravikumar et al., 2011) prove that is sufficient for correctly estimating the GMRF, where is the graph node-degree. Moreover, the above formulation still makes sense for learning other graph models, such as the Ising model, due to the connection of to the Bregman distance (Banerjee et al., 2008).

Numerical solution methods for solving problem (1) have been widely studied in the literature. For instance, in (Banerjee et al., 2008; Scheinberg & Rish, 2009; Scheinberg et al., 2010; Hsieh et al., 2011; Rolfs et al., 2012; Olsen et al., 2012) the authors proposed first order primal and dual approaches to (1) and used state-of-the-art structural convex optimization techniques such as coordinate descent methods and Lasso-based procedures. Alternatively, the authors in (Hsieh et al., 2011; Olsen et al., 2012) focused on the second order methods and, practically, achieved fast methods with a high accuracy. In (Scheinberg et al., 2010; Yuan, 2012), the authors studied alternating direction methods to solve (1), while the work in (Li & Toh, 2010) is based on interior point-type methods. Algorithmic approaches where more structure is known a priori can be found in (Lu, 2010).

The complexity of the state-of-the-art approaches mentioned above is dominated by the Cholesky decomposition ( in general), which currently creates an important scalability bottleneck. This decomposition appears mandatory since all these approaches employ a guess-and-check step-size selection procedures to ensure the iterates remain in the positive definite (PD) cone and the inversion of a matrix, whose theoretical cost normally scales with the cost of matrix multiplications ( direct, Strassen, and Coppersmith-Winograd). The inversion operation is seemingly mandatory in the optimization of (1) since the calculation of the descent direction requires it, and quadratic cost approximations to also need it. Via Cholesky decompositions, one can first check if the current solution satisfies the PD cone constraint and then recycle the decomposition for inversion for the next iteration.

Contributions: We propose a new proximal-Newton framework for solving the general problem of (1) by only assuming that is self-concordant. Our algorithm consists of two phases. In Phase 1, we apply a damped proximal Newton scheme with a new, analytic step-size selection procedure, and prove that our objective function always decreases at least a certain fixed amount. As a result, we avoid globalization strategies such as backtracking line-search or trust-region procedures in the existing methods. Moreover, our step-size selection is optimal in the sense that it cannot be improved without additional assumptions on the problem structure. In Phase 2, we simply apply the full step proximal-Newton iteration as we get into its provable quadratic convergence region which we can compute explicitly. Moreover, we do not require any additional assumption such as the uniform boundedness of the Hessian as in (Lee et al., 2012).

In the context of graph learning, we discuss a specific instance of our framework, which avoids Cholesky decompositions and matrix inversions altogether. Hence, the per iteration complexity of our approach is dominated by the cost of matrix multiplications. This is because our analytic step-size selection procedure ensures the positive definiteness of the iterates doing away with global strategies such as line-search which demands the objective evaluations (via Cholesky), and we avoid calculating the gradient explicitly, and hence matrix inversion by a careful dual formulation. As a result, our approach is attractive for distributed and parallel implementations.

Paper outline: In Section 2, we first recall some fundamental concepts of convex optimization and self-concordant functions. Then, we describe the basic optimization set up and show the unique solvability of the problem. In Section 3 we outline our algorithmic framework and describe its analytical complexity. We also deal with the solution of the subproblems by applying the new dual approach in this section. Section 4 presents an application of our theory to graph selection problems. Experimental results on real graph learning problems can be found in Section 5.

## 2 Preliminaries

Basic definitions: We reserve lower-case and bold lower-case letters for scalar and vector representation, respectively. Upper-case bold letters denote matrices. Let : be the vectorization operator which maps a matrix to a single column, and : is the inverse mapping of which transforms a vector to a matrix. For a closed convex function , we denote its domain by , .

###### Definition 2.1 (Self-concordant functions (Definition 2.1.1, pp. 12, (Nesterov & Nemirovski, 1994)).

A convex function is standard self-concordant if . Furthermore, a function is self-concordant if, for any , the function is self-concordant for all and .

Let be a strictly convex and self-concordant function. For a given vector , the local norm around with respect to is defined as while the corresponding dual norm is given as . Let be a function defined as and be a function defined as . The functions and are both nonnegative, strictly convex and increasing. Based on (Nesterov, 2004)[Theorems 4.1.7 & 4.1.8], we recall the following estimates:

 ω(∥y−x∥x)+∇h(x)T(y−x)+h(x)≤h(y), (2) h(y)≤h(x)+∇h(x)T(y−x)+ω∗(∥y−x∥x), (3)

where (2) holds for all , and (3) holds for all such that .

Problem statement: In this paper, we consider the following structural convex optimization problem:

 minx∈Rn{F(x)  |  F(x):=f(x)+g(x)}, (4)

where is a convex, self-concordant function and is a proper, lower semicontinuous and possibly nonsmooth convex regularization term. It is easy to certify that problem (1) can be transformed into (4) by using the tranformation :

 \footnotesizef(x):={−logdet(mat(x))+tr(ˆΣmat(x)),mat(x)≻0,+∞otherwise,

and .

Proximity operator: A basic tool to handle nonsmooth convex functions is the proximity operator: let be a proper lower semicontinuous, possibly nonsmooth and convex in . We denote by the subdifferential of at . Let be a self-concordant function and be fixed. We define for . This operator is a nonexpansive mapping, i.e.,

 ∥∥P¯xg(u)−P¯xg(v)∥∥x≤∥u−v∥∗x,  ∀u,v. (5)

Unique solvability of the problem: We generalize the result in (Hsieh et al., 2011) to show that problem (4) is uniquely solvable.

###### Lemma 2.2.

For some , let for . Then the solution of (4) exists and is unique.

The proof of this lemma can be done similarly as Theorem 4.1.11, pp. 187 in (Nesterov, 2004). For completeness, we provide it in the supplementary document.

## 3 Two-phase proximal Newton method

Our algorithmic framework is simply a proximal-Newton method which generates an iterative sequence starting from . The new point is computed as , where is a step size and is the proximal-Newton-type direction as the solution to the subproblem:

 mind{Q(d;xk)+g(xk+d)}. (Q(xk))

Here, is the following quadratic surrogate of the function around :

 Q(d;xk):=f(xk)+∇f(xk)Td+12dT∇2f(xk)d. (6)

We denote the unique solution of . The optimality condition for is written as follows:

 0∈∂g(xk+dk)+∇f(xk)+∇2f(xk)dk. (7)

Fixed-point characterization. For given , if we define then the unique solution of can be computed as

 dk:=(Pxkg∘S)(xk)−xk=−(I−Rg)(xk). (8)

Here, . The next lemma shows that the fixed point of is the unique solution of (4). The proof is straightforward, and is omitted.

###### Lemma 3.1.

Let be a mapping defined by (8). Then is the unique solution of (4) if and only if is the fixed-point of , i.e., .

Lemma 3.1 suggests that we can generate an iterative sequence based on the fixed-point principle. Under certain assumptions, one can ensure that is contractive and the sequence generated by this scheme is convergent. Hence, we characterize this below.

### 3.1 Full-step proximal-Newton scheme

Here, we show that if we start sufficiently close to the solution , then we can compute the next iteration with full-step , i.e.,

 xk+1:=xk+dk, (9)

where is the unique solution to . We call this scheme the full-step proximal Newton (FPN) scheme. For any , let us define

 λk:=∥∥xk+1−xk∥∥xk. (10)

We refer to this quantity as the proximal Newton decrement. The following theorem establishes the local quadratic convergence of the FPN scheme (9).

###### Theorem 3.2.

For a given , let be the point generated by the full-step proximal Newton scheme (9) and be defined by (10). Then, if , it holds that

 λk+1≤(1−4λk+2λ2k)−1λ2k. (11)

Consequently, the sequence generated by the FPN scheme (9) starting from such that , locally converges to the unique solution of (4) at a quadratic rate.

The proof of Theorem 3.2 can be found in the supplementary document.

### 3.2 Damped proximal Newton scheme

We now establish that, with an appropriate choice of the step-size , the iterative sequence generated by the damped proximal Newton scheme

 xk+1:=xk+αkdk (12)

is a decreasing sequence, i.e., whenever , where is fixed. First, we show the following property for the new iteration .

###### Lemma 3.3.

Suppose that is a point generated by (12). Then, we have

 F(xk+1)≤F(xk)−[αkλ2k−ω∗(αkλk)], (13)

provided that .

###### Proof.

Let , where is the unique solution of . It follows from the optimality condition of (7) that there exists such that

 vk=−∇f(xk)−∇2f(xk)(yk−xk). (14)

Since is self-concordant, by (3), for any such that we have

 F(xk+1) ≤F(xk)+∇f(xk)T(xk+1−xk) (15) +ω∗(∥∥xk+1−xk∥∥xk)+g(xk+1)−g(xk).

Since is convex, , by using (14) we have

 g(xk+1)−g(xk) =g((1−αk)xk+αkyk)−g(xk) ≤αk[g(yk)−g(xk)] ≤αvTk(yk−xk) (16) =αkvTkdk −αk∇f(xk)Tdk−αk∥∥dk∥∥2¯xk.

Now, substituting (3.2) into (15) and noting that we obtain the following result

 F(xk+1) =F(xk)−[αkλ2k−ω∗(αkλk)], (17)

which is indeed (13), provided that . ∎

The following theorem provides an explicit formula for the step size .

###### Theorem 3.4.

Let be a new point generated by the scheme (12) and be defined by (10). Then, if we choose then

 F(xk+1)≤F(xk)−ω(λk). (18)

Moreover, the step is optimal.

###### Proof.

By the choice of , we have . By using the estimate (13) we have

 F(xk+1)≤F(xk)−(1+λk)−1λ2k+ω∗((1+λk)−1λk).

Since for any , the last inequality implies (18). Finally, we note that the function is maximized at , showing that is optimal. ∎

Theorem 3.4 shows that the damped proximal Newton scheme generates a new point that decreases of (4) at least at each iteration, whenever .

Similar to the full-step proximal-Newton scheme (9), we can also show the quadratic convergence of the damped proximal-Newton scheme (12). This statement is summarized in the following theorem.

###### Theorem 3.5.

For a given , let be a new point generated by the scheme (12) with . Then, if , it holds that

 λk+1≤2(1−2λk−λ2k)−1λ2k. (19)

Hence, the sequence generated by (12) with starting from such that locally converges to , the unique solution of (4) at a quadratic rate.

The proof of Theorem 3.5 can be found in the supplementary document. Note that the value in Theorem 3.5 is larger than in Theorem 3.2. However, both values are not tight.

### 3.3 The algorithm pseudocode

As proved by Theorems 3.4 and 3.5, we can use the damped proximal-Newton scheme to build the algorithm. Now, we present a two-phase proximal-Newton algorithm. We first select a constant . At each iteration, we compute the new point by using the damped proximal Newton scheme (12) until we get . Then, we switch to the full-step Newton scheme and perform it until the convergence is achieved. These steps are described in Algorithm 1.

Note that the radius of the quadratic convergence region in Algorithm 1 can be fixed at its upper bound . The maximum number of iterations and can also be specified, if necessary.

### 3.4 Iteration-complexity analysis

We analyze the complexity of Algorithm 1 by separating Phase 1 and Phase 2. This analysis is summarized in the following theorem.

###### Theorem 3.6.

The maximum number of iterations required in Phase 1 does not exceed , where is the unique solution of (4). The maximum number of iterations required in Phase 2 to obtain does not exceed , where .

###### Proof.

Since for all in Phase 1, it follows from Theorem 3.4 that . By induction we have . This implies that . Hence, we can fix

 jmax:=⌊F(x0)−F(x∗)ω(σ)⌋+1.

Let . By induction, it follows from Theorem 3.2 that we have . In order to ensure , we require , which leads to . Hence, we can show that . ∎

We note that we do not use as a stopping criterion of Phase 1 of Algorithm 1. In practice, we only need an upper bound of this quantity. If we fix at then and the complexity of Phase 2 becomes .

### 3.5 Dual solution approach of the subproblem

In this subsection we consider a specific instance of : . First, we derive a dual formulation of the convex subproblem . For notational convenience, we let , . Then, the convex subproblem can be written equivalently as

 miny∈Rn{12yTHky+(qk−Hkxk)Ty+ρ∥y∥1}. (20)

By using the min-max principle, we can write (20) as

 max∥u∥∞≤1miny∈Rn{12yTHky+(q−Hkxk)Ty+ρuTy}. (21)

Solving the inner minimization in (21) we obtain:

 min∥u∥∞≤1{12uTH−1ku+~qTku}, (22)

where . Note that the objective function of (22) is strongly convex. One can apply the fast projected gradient methods with linear convergence rate in (Nesterov, 2007; Beck & Teboulle, 2009) for solving this problem.

In order to recover the solution of the primal subproblem , we note that the solution of the parametric minimization problem in (21) is given by . Let be the optimal solution of (22). We can recover the primal proximal-Newton search direction of the subproblem as

 dk=−∇2f(xk)−1[∇f(xk)+ρu∗k]. (23)

To compute the quantity in Algorithm 1, we use (23) such that

 λk=∥∥∇f(xk)+ρu∗k∥∥∗xk. (24)

Note that computing by (24) requires the inverse of the Hessian matrix .

## 4 Application to graph selection

In this section, we customize the theory framework of Algorithm 1 by using only Phase 1 to solve the graph selection problem (1).

Quantification: For clarity, we retain the matrix variable form as presented in (1). We note that is a self-concordant convex function, while is a proper, lower semicontinuous and nonsmooth convex function. Thus, our theory presented above can be applied to (1). Given the current estimate , we have and . Under this setting, the dual subproblem (22) becomes:

 U∗i=argmin∥vec(U)∥∞≤1{12tr((ΘiU)2)+tr(˜QiU)}, (25)

where . Given the dual solution of (25), the primal proximal-Newton search direction (i.e. the solution of ) is computed as

 Δi:=−((ΘiˆΣ−I)Θi+ρΘiU∗iΘi). (26)

The quantity defined in (24) can be computed by

 λi:=(p−2⋅tr(Wi)+tr(W2i))1/2. (27)

where .

The graph learning algorithm: Algorithm 2 summarizes the proposed scheme for graph selection.

Overall, Algorithm 2 does not require any matrix inversion operation. It only needs matrix-vector and matrix-matrix calculations, making the parallelization of the code easier. We note that due to the predefined step-size selection in Algorithm 1 we do not need to do any backtracking line-search step. This advantage can avoid some overhead computation regarding the evaluation of the objective function which is usually expensive in this application.

Arithmetical complexity analysis: Since the analytical complexity is provided in Theorem 3.6, we only analyze the arithmetical complexity of Algorithm 2 here. As we work through the dual problem, the primal solution is dense even if majority of the entries are rather small (e.g., smaller than ).111In our MATLAB code, we made no attempts for sparsification of the primal solution. The overall complexity of the algorithm can be improved via thresholding tricks. Hence, the arithmetical complexity of Algorithm 2 is dominated by the complexity of matrix multiplications.

For instance, the computation of and require basic matrix multiplications. For the computation of , we require two trace operations: in time-complexity and in time-complexity. We note here that, while is a dense matrix, the trace operation requires only the computation of the diagonal elements of . Given , and , requires time-complexity.

To compute (25), we can use the fast projected gradient method (FPGM) (Nesterov, 2007; Beck & Teboulle, 2009) with step size where is the Lipschitz constant of the gradient of the objective function in (25). It is easy to observe that where

is the largest eigenvalue of

. For sparse , we can approximately compute is by using iterative power methods (typically, 10 iterations suffice). The projection onto clips the elements by unity in time. Thus, the time overhead due to acceleration is within .

Given the above, FPGM requires a constant number of iterations , which is independent of the dimension , to achieve an solution accuracy. Overall, the time-complexity for the solution in (25) is , where is the cost of matrix multiplication.

###### Remark 4.1 (Parallel and distributed implementation ability).

In Algorithm 2, the outer loop does not require any Cholesky decomposition or matrix inversion. Suppose that the fast projected gradient method is applied to solve the dual subproblem (25). The main operation needed in the whole algorithm is matrix-matrix multiplication of the form , where and are symmetric positive definite. This operation can naturally be computed in a parallel or distributed manner. For more details, we refer the reader to Chapter 1 in (Bertsekas & Tsitsiklis, 1989).

## 5 Numerical experiments

In this section we test DPNGS (Algorithm 2 in Section 4) and compare it with the state-of-the-art graph selection algorithm QUadratic Inverse Covariance (QUIC) algorithm (Hsieh et al., 2011) on a real world data set. The QUIC algorithm is also a Newton-based method, which in addition exploits the sparsity in solving its primal subproblems. We note that QUIC was implemented in C while our codes in this work are implemented in MATLAB.

Implementation details: We test DPNGS on MATLAB 2011b running on a PC Intel Xeon X5690 at 3.47GHz per core with 94Gb RAM. To solve (25), we use the FPGM scheme as detailed in the supplementary material. We terminate FPGM if either or the number of iterations reaches where and will be specified later. The stopping criterion of the outer loop is and the maximum number of outer iterations is chosen as . We test the following three variants of DPNGS: DPNGS [ and ], DPNGS() [ and ], and DPNGS() [ and ]. The DPNGS(5) and DPNGS(10) variants can be considered as inexact variants of DPNGS.

Real-world data: In our experiments, we use the real biology data preprocessed by (Li & Toh, 2010) to compare the performance of the DPNGS variants above and QUIC (Hsieh et al., 2011) for problems: Lymph (), Estrogen (), Arabidopsis (), Leukemia () and Hereditary (). This dataset can be found at http://ima.umn.edu/~maxxa007/send_SICS/.

Convergence behaviour analysis: First, we verify the convergence behaviour of Algorithm 2 by analyzing the quadratic convergence of the quantity , where is defined by (27). Our analysis is based on the Lymph problem with variables. We note that reveals the weighted norm of the proximal-gradient mapping of the problem. The convergence behaviour is plotted in Figure 1 for three different values of , namely , , and .

Figure 1 shows that whenever the values of gets into the quadratic region, it converges with only a few iterations. As becomes smaller, we need more iterations to get into the quadratic convergence region.

Next, we illustrate the step-size of DPNGS. Figure 2 shows the increasing behaviour of the step size on the same dataset. Since , it converges quickly at the last iterations.

We also compare the objective value decrement of both algorithms in -log-scale in Figure 3. Using the same tolerance level, we reach the objective value after iterations while QUIC needs iterations. Moreover, Figure 3 shows the quadratic convergence of our approach in contrast to QUIC; the latter requires many more iterations to slightly improve the objective. Figure 3: The difference of the objective values of DPNGS and QUIC in y-log-scale

Figure 4 is the histogram of the solution in scale reported by DPNGS and QUIC. Due to the dual solution approach, DPNGS reports an approximate solution with similar sparsity pattern as the one of QUIC. However, our solution has many small numbers instead of zero as in QUIC as revealed in Figure 4. This seems to be the main weakness of the dual approach: it obviates matrix inversions by avoiding the primal problem, which can return solutions with exact zeros thanks to its soft-thresholding prox-operator.

As a result, DPNGS carries around extremely small coefficients (almost of them smaller than ) often preventing it from achieving the same objective level as the numerical experiments on the full data set shows. At the same time, since the approach does not rely on coordinate descent on active sets, it appears much less sensitive to the choice of . This could be an advantage of DPNGS in applications requiring smaller values. If exact sparsity is needed, then a single primal iteration suffices to remove the small coefficients. Figure 4: The histogram of the coefficient absolute values of the solution in log-scale of DPNGS and QUIC (right).

Numerical experiments on the full dataset: We now report the numerical experiments on the biology dataset and compare the methods. We test both algorithms with four different values of , namely , , and . The numerical results are reported in Table 1. Since QUIC exceeds the maximum number of iterations and takes a long time, we do not report the results corresponding to . We note again that at the time of this ICML submission, our implementation is done in MATLAB, while QUIC code is (carefully!) implemented in C. Hence, our timings may improve.

We highlight several interesting results from Table 1. First, QUIC obtains the highest accuracy results in most cases, which we attribute to the “lack of soft thresholding” in our algorithm. As the DPNGS algorithm carries around a score of extremely small numbers (effectively making the solution dense in the numerical sense), its solutions are close to QUIC’s solutions within numerical precision. Moreover, QUIC is extremely efficient when the value is large, since it exploits the sparsity of the putative solutions via coordinate descent. Unsurprisingly, QUIC slows down significantly as is decreased.

DPGNS(5) and DPNGS(10) can obtain near optimal solutions quite rapidly. In particular, DPNGS(10) seems to be the most competitive across the board, often taking a fraction of QUIC’s time to provide a very close solution. Hence, one can expect these schemes to be used for initializing other algorithms. For instance, QUIC can be a good candidate. We observed in all cases that, in the first few iterations, QUIC performs several Cholesky decompositions to stay within the positive definite cone. As the complexity of such operation is large, our step-size selection within QUIC or a DPNGS(10) initialization can be helpful.

## 6 Conclusions

In this paper, we present the new composite self-concordant optimization framework. As a concrete application, we demonstrate that graph learning is possible without any Cholesky decompositions via analytic step-size selection as well as without matrix inversions via a careful dual formulation within our framework. By exploiting the self-concordance in the composite graph learning objective, we provide an optimal step-size for this class of composite minimization with proximal Newton methods. We show that within the dual formulation of the Newton subproblem, we do not need to explicitly calculate the gradient as it appears in a multiplication form with the Hessian. Thanks to the special structure of this multiplication, we avoid matrix inversions in graph learning. Overall, we expect our optimization framework to have more applications in signal processing/machine learning and be amenable to various parallelization techniques, beyond the ones considered in the graph learning problem.

## References

• Banerjee et al. (2008) Banerjee, O., El Ghaoui, L., and d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008.
• Beck & Teboulle (2009) Beck, A. and Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
• Bertsekas & Tsitsiklis (1989) Bertsekas, D.P. and Tsitsiklis, J. N. Parallel and distributed computation: Numerical methods. Prentice Hall, 1989.
• Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. University Press, Cambridge, 2004.
• Dempster (1972) Dempster, A. P. Covariance selection. Biometrics, 28:157–175, 1972.
• Hsieh et al. (2011) Hsieh, C. J., Sustik, M.A., Dhillon, I.S., and Ravikumar, P. Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neutral Information Processing Systems (NIPS), 24:1–18, 2011.
• Lee et al. (2012) Lee, J.D., Sun, Y., and Saunders, M.A. Proximal newton-type methods for convex optimization. Tech. Report., pp. 1–25, 2012.
• Li & Toh (2010) Li, L. and Toh, K.C. An inexact interior point method for l 1-regularized sparse covariance selection. Mathematical Programming Computation, 2(3):291–315, 2010.
• Lu (2010) Lu, Z. Adaptive first-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications, 31(4):2000–2016, 2010.
• Nesterov (2004) Nesterov, Y. Introductory lectures on convex optimization: a basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.
• Nesterov (2007) Nesterov, Y. Gradient methods for minimizing composite objective function. CORE Discussion paper, 76, 2007.
• Nesterov & Nemirovski (1994) Nesterov, Y. and Nemirovski, A. Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, 1994.
• Olsen et al. (2012) Olsen, P.A., Oztoprak, F., Nocedal, J., and Rennie, S.J. Newton-like methods for sparse inverse covariance estimation. Optimization Online, 2012.
• Ravikumar et al. (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Statist., 5:935–988, 2011.
• Rolfs et al. (2012) Rolfs, B., Rajaratnam, B., Guillot, D., Wong, I., and Maleki, A. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems 25, pp. 1583–1591, 2012.
• Scheinberg & Rish (2009) Scheinberg, K. and Rish, I. Sinco-a greedy coordinate ascent method for sparse inverse covariance selection problem. preprint, 2009.
• Scheinberg et al. (2010) Scheinberg, K., Ma, S., and Goldfarb, D. Sparse inverse covariance selection via alternating linearization methods. arXiv preprint arXiv:1011.0097, 2010.
• Yuan (2012) Yuan, X. Alternating direction method for covariance selection models. Journal of Scientific Computing, 51(2):261–273, 2012.

## Appendix A The proofs of technical statements

### a.1 The proof of Theorem 3.2

###### Proof.

Let , we define

 Pgk:=(∇2f(xk)+∂g)−1, Sk(z):=∇2f(xk)z−∇f(z).

and

 ek≡ek(x):=[∇2f(%xk)−∇2f(x)]dk).

It follows from the optimality condition (7) in the main text that

 0∈∂g(xk+1)+∇f(xk)+∇2f(xk)(xk+1−xk).

This condition can be written equivalently to

 Sk(xk)+ek(xk)∈∇2f(xk)xk+1+∂g(xk+1).

Therefore, the last relation leads to

 xk+1=Pgk(Sk(xk)+ek). (28)

If we define then

 dk=Pgk(Sk(xk)+ek)−xk.

Consequently, we also have

 dk+1=Pgk(Sk(xk+1)+ek+1)−xk+1. (29)

We consider the norm . By using the nonexpansive property of , it follows from (28) and (29) that

 λ1k =∥∥dk+1∥∥xk =∥∥Pgk(Sk(xk+1)+ek+1)−Pgk(Sk(xk)+ek)∥∥xk (5)≤∥∥Sk(xk+1)+ek+1−Sk(xk)−ek∥∥∗xk ≤∥∥∇f(xk+1)−∇f(xk)−∇2f(xk)(xk+1−xk)∥∥∗xk +∥ek+1