 # A Coordinate-wise Optimization Algorithm for Sparse Inverse Covariance Selection

Sparse inverse covariance selection is a fundamental problem for analyzing dependencies in high dimensional data. However, such a problem is difficult to solve since it is NP-hard. Existing solutions are primarily based on convex ℓ_1 approximation and iterative hard thresholding, which only lead to sub-optimal solutions. In this work, we propose a coordinate-wise optimization algorithm to solve this problem which is guaranteed to converge to a coordinate-wise minimum point. The algorithm iteratively and greedily selects one variable or swaps two variables to identify the support set, and then solves a reduced convex optimization problem over the support set to achieve the greatest descent. As a side contribution of this paper, we propose a Newton-like algorithm to solve the reduced convex sub-problem. Finally, we demonstrate the efficacy of our method on synthetic data and real-world data sets. As a result, the proposed method achieves state-of-the-art performance in term of accuracy.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we mainly focus on the following nonconvex optimization problem:

 minX∈Rn×n f(X)≜⟨Σ,X⟩−logdet(X),s.t. X≻0, ∥X∥0,off≤s, (1)

where is a given symmetric covariance matrix of the input data set, counts the number of non-diagonal and non-zero elements of a square matrix, and is a positive integer that specifies the sparsity level of the solution. means is positive definite. stands for the standard inner product.

The optimization problem in (1) is known as sparse inverse covariance selection in the literature [9, 13]

. It provides a good way of analyzing dependencies in high dimensional data and captures varieties of applications in computer vision and machine learning (e.g. biomedical image analysis

, scene labeling , brain functional network classification 

). The log-determinant function is introduced for maximum likelihood estimation, and the

norm is used to reduce over-fitting and improve the interpretability of the model. We remark that when the sparsity constraint is absent, one can set the gradient of the objective function to zero (i.e. ) and output as the optimal solution.

Problem (1) is very challenging due to the introduction of the combinatorial norm. Existing solutions can be categorized into two classes: convex approximation and iterative hard thresholding. Convex approximation simply replaces the norm by its tightest convex relaxation norm. In the past decades, a plethora of approaches have been proposed to solve the norm approximation problem, which include projected sub-gradient method , (linearized) alternating direction method [26, 34], quadratic approximation method [25, 24, 12, 11], block coordinate descent method [9, 1], Nesterov’s first-order optimal method [18, 19, 6], primal-dual interior point method . Despite the popularity of convex methods, they fail to control the sparsity of the solution and often lead to sub-optimal accuracy for the original non-convex problem. Recent attention has been paid to solving the original non-convex problem directly by the researchers [31, 36, 5].

Iterative hard thresholding method considers iteratively setting the small elements (in magnitude) to zero in a gradient descent manner. By using this strategy, it is able to control the sparsity of the solution directly and exactly. Due to its simplicity, it has been widely used and incorporated into the optimization framework of penalty decomposition algorithm  and mean doubly alternating direction method . In , it is shown that for the general sparse optimization problem, any accumulation point of the sequence generated by the penalty decomposition algorithm always satisfies the first-order optimality condition of the problem.

Recently, A. Beck and Y. Vaisbourd present and analyze a new optimality criterion which is based on coordinate-wise optimality 

. They show that coordinate-wise optimality is strictly stronger than the optimality criterion based on hard thresholding. They apply their algorithm to principal component analysis and show that their method consistently outperforms the well-known truncated power method

. Inspired by this work, we extend their method to solve sparse inverse covariance selection problem. We are also aware of the work  where a cyclic coordinate descent decent algorithm (combined with a randomized initialization strategy) is considered to solve the sparse inverse covariance selection problem. However, their method only addresses the norm regularized optimization problem and it fails to control the sparsity level of the solution.

Contributions: The contributions of this work are three-fold. (i) We propose a new coordinate-wise optimization algorithm for sparse inverse covariance selection. The algorithm iteratively and greedily selects one variable or swap two variables to identify the support set, and then solves a reduced convex optimization problem over the support set (See Section 2). (ii) An efficient Hessian-free Newton-like algorithm to solve the convex subproblem is proposed (Section 3). (iii) We provide some theoretical analysis for the proposed Coordinate-Wise Optimization Algorithm (CWOA) and the Newton-Like Optimization Algorithm (NLOA). We prove that CWOA is guaranteed to converge to a coordinate-wise minimum point of the original nonconvex problem and the NLOA is guaranteed to converge to the global optimal solution of the convex subproblem with global linear rate and local quadratic convergence rate (Section 4). (iv) Extensive experiments have shown that our method consistently outperforms existing solutions in terms of accuracy (Section 5).

Notations:

In this paper, boldfaced lowercase letters denote vectors and uppercase letters denote real-valued matrices. We denote

as the eigenvalues of

in increasing order. All vectors are column vectors and superscript denotes transpose. stacks the columns of the matrix into a column vector and converts into a matrix. Thus, and . We use and to denote the Euclidean inner product and Kronecker product of and , respectively. For any matrix and any , we denote by the element of in row and column and use to denote the position of . Therefore, we have . We denote as a unit vector with a 1 in the entry and 0 in all other entries. We use to denote any position in a square matrix of size where is known from the context, and use and to denote the corresponding row and column for . We denote is a square symmetric matrix with the entries and equal 1 and 0 in all other ones. Note that when , we have . Finally, for any matrices and , we define and .

## 2 Coordinate-wise Optimization Algorithm

This section presents our coordinate-wise optimization algorithm which is guaranteed to converge after a finite amount of iterations to a coordinate-wise minimum point [2, 3]. We denote and as the index of non-diagonally non-zero elements and zero elements of , respectively.

First of all, we notice that when the support set is known, problem (1) reduces to the following convex optimization problem:

 minX≻0 f(X), s.t. XZ=0, with Z≜{1,2,...,n2}∖S. (2)

Our algorithm iteratively and greedily selects one variable or swaps two variables to identify the support set , and then solves a reduced convex sub-problem in (2) to achieve the greatest descent.

We summarize our proposed method in Algorithm 1 and have a few remarks on it below.

Two-stage algorithm. At each iteration of the algorithm, one or two variables of the solution are updated. At the first greedy pursuit stage, the algorithm greedily picks one coordinate that leads to the greatest descent from . This strategy is also known as forward greedy selection in the literature [28, 37]. At the second swap coordinates stage, the algorithm enumerates all the possible pairs with and that leads to the greatest descent and changes the two coordinates from zero/non-zero to non-zero/zero. At both stages, once the support set has been updated, Algorithm 1 runs a convex subproblem procedure to solve (2) over the support set to compute a more ‘compact’ solution.

One-dimensional sub-problem. The problems in (3) and (4) reduce to the following optimization problem:

 minθ f(θ)≜⟨Σ,V+θEj⟩−logdet(V+θEj)s.t. V+θEi≻0 (5)

with for (3) and for (4). We now discuss how to simplify problem (5). We define and obtain the following equations: , where . Noticing that and , we can simplify problem (5) to the following one-dimensional convex problem:

 minθ f(θ)≜2θΣrc−log(1+2Yrcθ−Orcθ2)+C s.t. 1+2Yrcθ−Orcθ2>0,

where is a constant. Noting that is differentiable, we set the gradient of to zero, we obtain . There are two solution to this equation. However, only one of these satisfies the bound constraint . Thus, the optimal solution can be computed as:

 θ∗=⎧⎪ ⎪⎨⎪ ⎪⎩Yrc/Orc,                                if Σrc=0;Yrc/Orc+1/(2Σrc)−√O2rc+4Σ2rcYrrYcc/(2OrcΣrc), if Σrc≠0.

Fast matrix computation. In our algorithm, we assume that is available. This can be achieved by using the follow strategy. We keep a record of in every iteration. Once the solution is changed to , we quickly estimate using the well-known Sherman-Morrison-Woodbury formula 111 . Specifically, we rewrite as and apply the Sherman-Morrison-Woodbury formula with and , leading to the following equation: , where diag is a diagonal matrix with as the main diagonal entries and

is an identity matrix. Finally, we obtain the following results:

, where , , , , and denotes the -th column of .

Remarks: (i) Algorithm 1 can be viewed as an improved version of classical greedy pursuit method for solving the sparsity-constrained inverse covariance selection problem. Given the fact that greedy pursuit methods achieve state-of-the-art performance in varieties of non-convex optimization problems (e.g. compressed sensing , kernel learning , and sensor selection ), our proposed method is expected to achieve state-of-the-art performance as well. (ii) Algorithm 1 is also closely related to forward-backward greedy method in the literature . To obtain the greatest descent, while forward-backward strategy considers the removal step and adding step sequentially, the swapping strategy (refer to the swap coordinates stage in Algorithm 1) considers these two steps simultaneously. Thus, the swapping strategy is generally stronger than the forward-backward strategy.

## 3 Convex Optimization Over Support Set

After the support set has been determined, one need to solve the reduced convex sub-problem as in (2), which is equivalent to the following convex composite minimization problem:

 minX≻0 F(X)≜f(X)+p(X),with p(X)≜IΩ(X), Ω≜{X | XZ=0}, (6)

where and is an indicator function of the convex set with . In what follows, we present an efficient Newton-Like Optimization Algorithm (NLOA) to tackle this problem. This method has the good merits of greedy descent and fast convergence.

Following [29, 35, 16, 32], we develop a quadratic approximation around any solution for the objective function using second-order Taylor expansion:

 q(Θ,X)≜f(X)+⟨Θ,g(X)⟩+12vec(Θ)Th(X)vec(Θ),

where the first-order and second-order derivatives of the objective function can be expressed as :

 g(X)=Σ−X−1, h(X)=X−1⊗X−1.

Then, we keep the non-smooth function and build a quadratic approximation for the smooth function by:

 d(Xt)≜argminΔ q(Δ;Xt)+p(Xt+Δ). (7)

Once the Newton direction has been computed, one can employ an Arimijo-rule based step size selection to ensure positive definiteness and sufficient descent of the next iterate. We summarize our Newton-like algorithm in Algorithm 2. Note that the initial point has to be a feasible solution and the positive definiteness of all the following iterates will be guaranteed by the step size selection procedure (see step 7 in Algorithm 3). For notational convenience, we use the shorthand

 ft=f(Xt), Gt=g(Xt), Ht=h(Xt), Dt=d(Xt)

to denote the objective value, first-order gradient, hessian matrix and the search direction at the point , respectively.

### 3.1 Computing the Search Direction

This subsection focuses on finding the search direction in (7). With the choice of and , (7) boils down to the following optimization problem:

 minΔ ⟨Δ,Gt⟩+12vec(Δ)THtvec(Δ), s.t. ΔZ=0. (8)

It appears that (8) is very difficult to solve. First, it involves computing and storing an Hessian matrix . Second, it is a constrained optimization program with variables and equality constraints.

We carefully analyze (8) and consider the following solutions. For the first issue, one can exploit the Kronecker product structure of the Hessian matrix to avoid storing it. Recall that . Given any vector , using the fact that the Hessian matrix can be computed as , the Hessian-vector product can be computed efficiently as: , which only involves matrix-matrix computation. For the second issue, (8) is, in fact, a unconstrained quadratic program with variables. In order to deal with the variables indexed by , one can explicitly enforce the entries of for current solution and its corresponding gradient to 0. Therefore, the constraint can always be satisfied. Finally, linear conjugate gradient method can be used to solve (8).

We summarize our modified linear conjugate gradient method for computing the search direction in Algorithm 3. The algorithm involves a parameter controlling the maximum number of iterations. Empirically, we found that a value of usually leads to good overall efficiency.

### 3.2 Computing the Step Size

Once the Newton direction is computed, we need to find a step size in order to ensure the positive definiteness of the next iterated result, i.e. , so that a sufficient decrease of the objective function will be resulted. We use Armijo’s rule and try step size with a constant decrease rate until we find the smallest with such that is (i) positive definite, and (ii) satisfies the following sufficient decrease condition :

 f(Xt+αtDt)≤f(Xt)+αtω⟨Gt,Dt⟩,

where . In our experiments, we set and .

We verify positive definiteness of the solution when we compute its Cholesky factorization (taking flops). We note that the Cholesky factorization dominates the computational cost in the step-size computations. To reduce the computation cost, we can reuse the Cholesky factor in the previous iteration when evaluating the objective function (that requires the computation of ) and the gradient (that requires the computation of ).

## 4 Theoretical Analysis

### 4.1 Convergence Analysis of Algorithm 1

We present the convergence results for Algorithm 1, which are analogous to the results in .

###### Proposition 1.

Let be the sequence generated by algorithm 1. Algorithm 1 outputs a coordinate-wise minimum point with for every , where .

###### Proof.

Note that it takes finite iterations for any convex optimization algorithm to produce an optimal solution with a given support set. Combining with the monotonicity of Algorithm 1, we conclude that the sequence of function values are monotonically decreasing and Algorithm 1 stops after a finite number of iterations. We define:

 N0={P | P∈N, ¯S(P)⊆¯S(X∗)}, N1={P | P∈N, ¯S(P)=¯S(X∗)∪{j}}, N2={P | P∈N, ¯S(P)=¯S(X∗)/{i}∪{j}},

for all . Clearly, we have . Now we assume that point is generated by Algorithm 1.

For the case , is a global optimal point generated by the convex optimization subproblem on the given support set. Therefore, for any .

For the case , we notice that Algorithm 1 terminates only if after the swap coordinates stage. For any and , we have the following inequality:

 fi,j=minθ{f(Xk−XkiEi+θEj)}≥f(X∗).

Therefore, we have that , which implicates that we cannot find any swap from support set and non-support set to achieve descent on the objective value. Thus, for any .

For the case , Algorithm 1 must perform greedy pursuit stage before entering the swap coordinates stage. The greedy stage terminates only if for any ,

 fj=minθ{f(Xk+θEj)}≥f(X∗).

It implies that we have selected the element that leads to greatest descent as a new member of non-zero elements when . We conclude that , for any .

Therefore, we finish the proof of this lemma. ∎

### 4.2 Convergence Analysis of Algorithm 2

This subsection provides some convergence analysis for the proposed Newton-like optimization algorithm in Algorithm 2. We denote as the sequence generated by the algorithm and as the global optimal solution set for the convex problem in (2). Throughout this subsection, we make the following assumption.

###### Assumption 1.

The objective function is strongly convex with the modulus and gradient Lipschitz continues with constant for all with .

Remarks: This assumption is mild and equivalent to assuming the solution is bounded since it holds that

 σ≤λ(Ht)≤L⇔σ≤λ((Xt)−1⊗(Xt)−1)≤L ⇔

where as the eigenvalues of in increasing order with .

The following lemma characterizes the optimality of . It is nearly identical to Lemma 1 in . For completeness, we present the proof here.

###### Lemma 1.

It holds that

 ∥Dt∥2Ht+⟨Gt,Dt⟩≤0, ∀Dt with Xt+Dt∈Ω. (9)
###### Proof.

Noticing is the minimizer of (7), we have:

 q(Dt;Xt)+p(Xt+Dt)≤q(Z;Xt)+p(Xt+Z), ∀Z.

Letting where is any constant with , we obtain:

 ⟨Gt,Dt⟩+12∥Dt∥2Ht+p(Xt+Dt) ≤ ⟨Gt,αDt⟩+12∥αDt∥2Ht+p(Xt+αDt) ≤ ⟨Gt,αDt⟩+12∥αDt∥2Ht+αp(Xt+Dt)+(1−α)p(Xt),

where the last inequality uses the convexity of . Rearranging terms yields:

 (1−α)(⟨Gt,Dt⟩+p(Xt+Dt)−p(Xt))≤α2−12∥Dt∥2Ht ⟨Gt,Dt⟩+p(Xt+Dt)−p(Xt)≤−1+α2∥Dt∥2Ht.

Since , we have . Letting , we obtain (9).

###### Theorem 1.

(Global Convergence). We have the following results: (i) There exists a strictly positive constant such that the positive definiteness and sufficient descent conditions (refer to step 7-8 of Algorithm 2) are satisfied. Here denotes a sufficient small positive constant. and are some constants which are independent of the current solution . (ii) The sequence is non-increasing and any cluster point of the sequence is the global optimal solution of (6).

###### Proof.

(i) First, we focus on the positive definiteness condition. We now bound . By Lemma 1, we have :

 0 ≥ ⟨Dt,Gt⟩+∥Dt∥2Ht (10) ≥ −λn(Dt)λn(Gt)+σ∥Dt∥2F = −λn(Dt)λn(Σ−(Xt)−1)+σ∥Dt∥2F ≥ −λn(Dt)⋅(λn(Σ)+1/√σ)+σ(λn(Dt))2,

where the second step uses the fact that and the inequality that ; the third step uses the definition of ; the last step uses the inequalities that . Solving the quadratic inequality in (10) gives . Therefore, we have:

 0≺(1/√L−C1αt)I⪯Xt−αtλn(Dt)I⪯Xt+αtDt.

where the first steps uses the fact that ; the second step uses and ; the last step uses .

Second, we focus on the sufficient decrease condition. For any , we have:

 f(Xt+αtDt)−f(Xt) (11) ≤ α⟨Dt,Gt⟩+(αt)2L2∥Dt∥2F ≤ αt(⟨Dt,Gt⟩+αtL2σ∥D∥2Ht) ≤ αt(⟨Dt,Gt⟩−αtL2σ⟨Dt,Gt⟩) = αt⟨Dt,Gt⟩(1−αtL2σ)≤αt⟨Dt,Gt⟩⋅ω,

where the first step uses the -Lipschitz continuity of the gradient of that: ; the second step uses the lower bound of the Hessian matrix that ; the third step uses (9) that ; the last step uses the choice that .

Combining the positive definiteness condition, sufficient decrease condition and the fact that , we finish the proof of the first part of this lemma.

(ii) From (11) and (9), we have:

 ∀t, f(Xt+1)−f(Xt) ≤ αtω⟨Dt,Gt⟩≤−αtω∥Dt∥2Ht ≤ −σαtω∥Dt∥2F=−ν∥Dt∥2F, with ν≜σαtω>0. (12)

Therefore, the sequence is non-increasing. Summing the inequality in (1) over and using the fact that , we have:

 f(Xt)−f(X0)≤−ν∑t−1i=0∥Di∥2F ⇒ f(X∗)−f(X0)≤−ν∑t−1i=0∥Di∥2F ⇒ (f(X0)−f(X∗))/(tν)≥mini=0,1,...,t−1∥Di∥2F.

As , we have . We further derive the following results: . Based on the fact that , , and , we conclude that is the global optimal solution for the convex optimization problem. Therefore, any cluster point of the sequence is the global optimal solution.

Remarks: Due to the strongly convexity and gradient Lipschitz continuity of the objective function, there always exists a strictly positive step size such that both sufficient decrease condition and positive definite condition can be satisfied for all . This is very crucial for the global convergence of Algorithm 2.

We now prove the global linear convergence rate of Algorithm 2. The following lemma characterizes the relation between and for any .

###### Lemma 2.

If is not the global optimal solution of (6), there exists a constant such that .

###### Proof.

First, we prove that . This can be achieved by contradiction. Assuming that , we obtain: , which implies that is the stationary point. Since (6) is a strongly convex optimization problem, we have , which contradicts with the condition that