# Graphical Lasso and Thresholding: Equivalence and Closed-form Solutions

Graphical Lasso (GL) is a popular method for learning the structure of an undirected graphical model, which is based on an l_1 regularization technique. The first goal of this work is to study the behavior of the optimal solution of GL as a function of its regularization coefficient. We show that if the number of samples is not too small compared to the number of parameters, the sparsity pattern of the optimal solution of GL changes gradually when the regularization coefficient increases from 0 to infinity. The second objective of this paper is to compare the computationally-heavy GL technique with a numerically-cheap heuristic method for learning graphical models that is based on simply thresholding the sample correlation matrix. To this end, two notions of sign-consistent and inverse-consistent matrices are developed, and then it is shown that the thresholding and GL methods are equivalent if: (i) the thresholded sample correlation matrix is both sign-consistent and inverse-consistent, and (ii) the gap between the largest thresholded and the smallest un-thresholded entries of the sample correlation matrix is not too small. By building upon this result, it is proved that the GL method--as a conic optimization problem--has an explicit closed-form solution if the thresholded sample correlation matrix has an acyclic structure. This result is then generalized to arbitrary sparse support graphs, where a formula is found to obtain an approximate solution of GL. The closed-form solution approximately satisfies the KKT conditions for the GL problem and, more importantly, the approximation error decreases exponentially fast with respect to the length of the minimum-length cycle of the sparsity graph. The developed results are demonstrated on synthetic data, electrical circuits, functional MRI data, and traffic flows for transportation networks.

Comments

There are no comments yet.

## Authors

• 11 publications
• 26 publications
• ### Sparse Inverse Covariance Estimation for Chordal Structures

In this paper, we consider the Graphical Lasso (GL), a popular optimizat...
11/24/2017 ∙ by Salar Fattahi, et al. ∙ 0

read it

• ### Sparse Inverse Covariance Selection via Alternating Linearization Methods

Gaussian graphical models are of great interest in statistical learning....
10/30/2010 ∙ by Katya Scheinberg, et al. ∙ 0

read it

• ### On the Closed-form Proximal Mapping and Efficient Algorithms for Exclusive Lasso Models

The exclusive lasso regularization based on the ℓ_1,2 norm has become po...
02/01/2019 ∙ by Yancheng Yuan, et al. ∙ 0

read it

• ### Exact covariance thresholding into connected components for large-scale Graphical Lasso

We consider the sparse inverse covariance regularization problem or grap...
08/18/2011 ∙ by Rahul Mazumder, et al. ∙ 0

read it

• ### Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

This paper considers the problem of estimating multiple related Gaussian...
03/07/2015 ∙ by Qingming Tang, et al. ∙ 0

read it

• ### Bayesian Learning of Random Graphs & Correlation Structure of Multivariate Data, with Distance between Graphs

We present a method for the simultaneous Bayesian learning of the correl...
10/31/2017 ∙ by Kangrui Wang, et al. ∙ 0

read it

• ### Sparse Recovery via Differential Inclusions

In this paper, we recover sparse signals from their noisy linear measure...
06/30/2014 ∙ by Stanley Osher, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There has been a pressing need in developing new and efficient computational methods to analyze and learn the characteristics of high-dimensional data with a structured or randomized nature. Real-world data sets are often overwhelmingly complex, and therefore it is important to obtain a simple description of the data that can be processed efficiently. In an effort to address this problem, there has been a great deal of interest in sparsity-promoting techniques for large-scale optimization problems

[1, 2, 3]. These techniques have become essential to the tractability of big-data analyses in many applications, including data mining  [4, 5, 6][7, 8], human brain functional connectivity [9], distributed controller design [10, 11], and compressive sensing [12, 13]

. Similar approaches have been used to arrive at a parsimonious estimation of high-dimensional data. However, most of the existing statistical learning techniques in data analytics are contingent upon the availability of a sufficient number of samples (compared to the number of parameters), which is difficult to satisfy for many applications

[14, 15]. To remedy the aforementioned issues, a special attention has been paid to the augmentation of these problems with sparsity-inducing penalty functions to obtain sparse and easy-to-analyze solutions.

Graphical lasso (GL) is one of the most commonly used techniques for estimating the inverse covariance matrix [16, 17, 18]. GL is an optimization problem that shrinks the elements of the inverse covariance matrix towards zero compared to the maximum likelihood estimates, using an regularization. There is a large body of literature suggesting that the solution of GL is a good estimate for the unknown graphical model, under a suitable choice of the regularization parameter [16, 17, 18, 19, 20, 21]. It is known that Graphical Lasso is computationally expensive for large-scale problems. An alternative computationally-cheap heuristic method for estimating graphical models is based on thresholding the sample covariance matrix.

In this paper, we develop a mathematical framework to analyze the relationship between the GL and thresholding techniques. The paper [22] offers a set of conditions for the equivalence of these two methods, and argues the satisfaction of these conditions in the case where the regularization coefficient is large or equivalently a sparse graph is sought. Although the conditions derived in [22] shed light on the performance of the GL, they depend on the optimal solution of the GL and cannot be verified without solving the problem. Nonetheless, it is highly desirable to find conditions for the equivalence of the GL and thresholding that are directly in terms of the sample covariance matrix. To this end, two notions of sign-consistent and inverse-consistent matrices are introduced, and their properties are studied for different types of matrices. It is then shown that the GL and thresholding are equivalent if three conditions are satisfied. The first condition requires a certain matrix formed based on the sample covariance matrix to have a positive-definite completion. The second condition requires this matrix to be sign-consistent and inverse-consistent. The third condition needs a separation between the largest thresholded and the smallest un-thresholded entries of the sample covariance matrix. These conditions can be easily verified for acyclic graphs and are expected to hold for sparse graphs. By building upon these results, an explicit closed-form solution is obtained for the GL method in the case where the thresholded sample covariance matrix has an acyclic support graph. Furthermore, this result is generalized to sparse support graphs to derive a closed-form formula that can serve either as an approximate solution of the GL or the optimal solution of the GL with a perturbed sample covariance matrix. The approximation error (together with the corresponding perturbation in the sample covariance matrix) is shown to be related to the lengths of the cycles in the graph.

The remainder of this paper is organized as follows. The main results are presented in Section 3, followed by numerical examples and case studies in Section 4. Concluding remarks are drawn in Section 5. Most of the technical proofs are provided in Appendix.

Notations:

Lowercase, bold lowercase and uppercase letters are used for scalars, vectors and matrices, respectively (say

). The symbols , and are used to denote the sets of real vectors, symmetric matrices and symmetric positive-semidefinite matrices, respectively. The notations and refer to the trace and the logarithm of the determinant of a matrix , respectively. The entry of the matrix is denoted by . Moreover, denotes the identity matrix. The sign of a scalar is shown as . The notations , and denote the absolute value of the scalar , the induced norm-1 and Frobenius norm of the matrix , respectively. The inequalities and mean that is positive-semidefinite and positive-definite, respectively. The symbol shows the sign operator. The ceiling function is denoted as . The cardinality of a discrete set is denoted as . Given a matrix , define

 ∥M∥1,off=d∑i=1d∑j=1|Mij|−d∑i=1|Mii|, ∥M∥max=maxi≠j|Mij|.
###### Definition 1.

Given a symmetric matrix , the support graph or sparsity graph of is defined as a graph with the vertex set and the edge set such that if and only if , for every two different vertices . The support graph of captures the sparsity pattern of the matrix and is denoted as .

###### Definition 2.

Given a graph , define as the complement of , which is obtained by removing the existing edges of and drawing an edge between every two vertices of that were not originally connected.

###### Definition 3.

Given two graphs and with the same vertex set, is called a subgraph of if the edge set of is a subset of the edge set of . The notation is used to denote this inclusion.

Finally, a symmetric matrix is said to have a positive-definite completion if there exists a positive-definite with the same size such that for every .

## 2 Problem Formulation

Consider a random vector

with a multivariate normal distribution. Let

denote the covariance matrix associated with the vector

. The inverse of the covariance matrix can be used to determine the conditional independence between the random variables

. In particular, if the entry of is zero for two disparate indices and , then and are conditionally independent given the rest of the variables. The graph (i.e., the sparsity graph of ) represents a graphical model capturing the conditional independence between the elements of . Assume that is nonsingular and that is a sparse graph. Finding this graph is cumbersome in practice because the exact covariance matrix is rarely known. More precisely, should be constructed from a given sample covariance matrix (constructed from samples), as opposed to . Let denote an arbitrary positive-semidefinite matrix, which is provided as an estimate of . Consider the convex optimization problem

 minS∈Sd+−logdet(S)+trace(ΣS). (1)

It is easy to verify that the optimal solution of the above problem is equal to . However, there are two issues with this solution. First, since the number of samples available in many applications is small or modest compared to the dimension of , the matrix is ill-conditioned or even singular. Under such circumstances, the equation leads to large or undefined entries for the optimal solution of (1). Second, although is assumed to be sparse, a small random difference between and would make highly dense. In order to address the aforementioned issues, consider the problem

 minS∈Sd+−logdet(S)+trace(ΣS)+λ∥S∥1,off, (2)

where is a regularization parameter. This problem is referred to as Graphical Lasso (GL). Intuitively, the term in the objective function serves as a surrogate for promoting sparsity among the off-diagonal entries of , while ensuring that the problem is well-defined even with a singular input . Henceforth, the notation will be used to denote a solution of the GL instead of the unregularized optimization problem (1).

Suppose that it is known a priori that the true graph has edges, for some given number . With no loss of generality, assume that all nonzero off-diagonal entries of have different magnitudes. Two heuristic methods for finding an estimate of are as follows:

• Graphical Lasso: We solve the optimization problem (2) repeatedly for different values of until a solution with exactly nonzero off-diagonal entries are found.

• Thresholding: Without solving any optimization problem, we simply identify those entries of that have the largest magnitudes among all off-diagonal entries of . We then replace the remaining off-diagonal entries of with zero and denote the thresholded sample covariance matrix as . Note that and have the same diagonal entries. Finally, we consider the sparsity graph of , namely , as an estimate for .

###### Definition 4.

It is said that the sparsity structures of Graphical Lasso and thresholding are equivalent if there exists a regularization coefficient such that .

Recently, we have verified in several simulations that the GL and thresholding are equivalent for electrical circuits and functional MRI data of 20 subjects, provided that is on the order of  [22]. This implies that a simple thresholding technique would obtain the same sparsity structure as the computationally-heavy GL technique. In this paper, it is aimed to understand under what conditions the easy-to-find graph is equal to the hard-to-obtain graph , without having to solve the GL. Furthermore, we will show that the GL problem has a simple closed-form solution that can be easily derived merely based on the thresholded sample covariance matrix, provided that its underlying graph has an acyclic structure. This result will then be generalized to obtain an approximate solution for the GL in the case where the thresholded sample covariance matrix has an arbitrary sparsity structure. This closed-form solution converges to the exact solution of the GL as the length of the minimum-length cycle in the support graph of the thresholded sample covariance matrix grows. The derived closed-form solution can be used for two purposes: (1) as a surrogate to the exact solution of the computationally heavy GL problem, and (2) as an initial point for common numerical algorithms to numerically solve the GL (see [16, 23]). The above results unveil fundamental properties of the GL in terms of sparsification and computational complexity. Although conic optimization problems almost never benefit from an exact or inexact explicit formula for their solutions and should be solved numerically, the formula obtained in this paper suggests that sparse GL and related graph-based conic optimization problems may fall into the category of problems with closed-form solutions (similar to least squares problems).

## 3 Main Results

In this section, we present the main results of the paper. In order to streamline the presentation, most of the technical proofs are postponed to Appendix.

### 3.1 Equivalence of GL and Thresholding

In this subsection, we derive sufficient conditions to guarantee that the GL and thresholding methods result in the same sparsity graph. These conditions are only dependent on and , and are expected to hold whenever is large enough or a sparse graph is sought.

###### Definition 5.

A matrix is called inverse-consistent if there exists a matrix with zero diagonal elements such that

 M+N≻0, supp(N)⊆(supp(M))(c), supp((M+N)−1))⊆supp(M).

The matrix is called inverse-consistent complement of and is denoted as .

The next Lemma will shed light on the definition of inverse-consistency by introducing an important class of such matrices that satisfy this property, namely the set of matrices with positive-definite completions.

###### Lemma 1.

Any arbitrary matrix with positive-definite completion is inverse-consistent and has a unique inverse-consistent complement.

Proof: Consider the optimization problem

 minS∈Sn trace(MS)−logdet(S) (4a) subject to Sij=0,∀(i,j)∈(supp(M))(c) (4b) S⪰0, (4c)

and its dual

 maxΠ∈Sn det(M+Π) (5a) subject to M+Π⪰0 (5b) supp(Π)⊆(supp(M))(c) (5c) Πii=0,i=1,...,d. (5d)

Note that is equal to the Lagrange multiplier for (4b) and every , and is zero otherwise. Since the matrix has a positive-definite completion, the dual problem is strictly feasible. Moreover, is a feasible solution of (4). Therefore, strong duality holds and the primal solution is attainable. On the other hand, the objective function (4a) is strictly convex, which makes the solution of the primal problem unique. Let denote the globally optimal solution of (4). It follows from the first-order optimality conditions that

 Sopt=(M+Πopt)−1.

This implies that

 supp(Πopt)⊆(supp(M))(c) supp((M+Πopt)−1)⊆supp(M) M+Πopt≻0.

As a result, is inverse-consistent and is its complement. To prove the uniqueness of the inverse-consistent complement of , let denote an arbitrary complement of . It follows from Definition 5 and the first-order optimality conditions that is a solution of (4). Since is the unique solution of (4), it can be concluded that . This implies that has a unique inverse-consistent complement.

###### Remark 1.

Two observations can be made based on Lemma 1. First, the positive-definiteness of a matrix is sufficient to guarantee that it belongs to the cone of matrices with positive-definite completion. Therefore, positive-definite matrices are inverse-consistent. Second, upon existence, the inverse-consistent complement of a matrix with positive-definite completion is equal to the difference between the matrix and its unique maximum determinant completion.

###### Definition 6.

An inverse-consistent matrix is called sign-consistent if the entries of and are nonzero and have opposite signs for every .

###### Example 1 (An inverse- and sign-consistent matrix).

To illustrate Definitions 5 and 6, consider the matrix

 M=⎡⎢ ⎢ ⎢⎣10.3000.31−0.400−0.410.2000.21⎤⎥ ⎥ ⎥⎦.

The graph is a path graph with the vertex set and the edge set . To show that is inverse-consistent, let the matrix be chosen as

 M(c)=⎡⎢ ⎢ ⎢⎣00−0.120−0.024000−0.080−0.120000−0.024−0.08000⎤⎥ ⎥ ⎥⎦.

The inverse matrix is equal to

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣10.91−0.30.9100−0.30.911+0.090.91+0.160.840.40.84000.40.841+0.160.84+0.040.96−0.20.9600−0.20.9610.96⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

Observe that:

• and are both positive-definite.

• The sparsity graphs of and are complements of each other.

• The sparsity graphs of and are identical.

• The nonzero off-diagonal entries of and have opposite signs.

The above properties imply that is both inverse-consistent and sign-consistent, and is its complement.

###### Definition 7.

Given a graph and a scalar , define as the maximum of over all matrices with positive-definite completions and with the diagonal entries all equal to 1 such that and .

Consider the dual solution introduced in the proof of Lemma 1 and note that it is a function of . Roughly speaking, the function in the above definition provides an upper bound on over all matrices with positive-definite completions and with the diagonal entries equal to 1 such that and . As will be shown later, this function will be used as a certificate to verify the optimality conditions for the GL.

Since is non-singular and we have a finite number of samples, the elements of the upper triangular part of

(excluding its diagonal elements) are all nonzero and distinct with probability one. Let

denote the absolute values of those upper-triangular entries such that

 σ1>σ2>...>σd(d−1)/2>0.
###### Definition 8.

Consider an arbitrary positive regularization parameter that does not belong to the discrete set . Define the index associated with as an integer number satisfying the relation . If is greater than , then is set to 0.

Throughout this paper, the index refers to the number introduced in Definition 8, which depends on .

###### Definition 9.

Define the residue of relative to as a matrix such that the entry of is equal to if and , and it is equal to 0 otherwise. Furthermore, define normalized residue of relative to as

 ~Σres(λ)=D−1/2×Σres(λ)×D−1/2,

where is diagonal matrix with for every .

Notice that is in fact the soft-thresholded sample covariance matrix with the threshold . For notational simplicity, we will use or instead of or whenever the equivalence is implied by the context. One of the main theorems of this paper is presented below.

###### Theorem 1.

The sparsity structures of the thresholding and GL methods are equivalent if the following conditions are satisfied:

• Condition 1-i: has a positive-definite completion.

• Condition 1-ii: is sign-consistent.

• Condition 1-iii: The relation

 β(supp(Σres),∥~Σres∥max)≤mini≠j|Σij|≤λλ−|Σij|√ΣiiΣjj

holds.

A number of observations can be made based on Theorem 1. First note that, due to Lemma 1, Condition (1-i) guarantees that is inverse-consistent; in fact it holds when itself is positive-definite. Note that the positive-definiteness of

is guaranteed to hold if the eigenvalues of the normalized residue of the matrix

relative to are greater than . Recall that for some integer and the off-diagonal entries of are in the range . In the case where the number is significantly smaller than , the residue matrix has many zero entries. Hence, the satisfaction of Condition (1-i) is expected for a large class of residue matrices; this will be verified extensively in our case studies on the real-world and synthetically generated data sets. Specifically, this condition is automatically satisfied if is diagonally dominant. Conditions (1-ii) and (1-iii) of Theorem 1 are harder to check. These conditions depend on the support graph of the residue matrix and/or how small the nonzero entries of are. The next two lemmas further analyze these conditions to show that they are expected to be satisfied for large .

###### Lemma 2.

Given an arbitrary graph , there is a strictly positive constant number such that

 β(G,α)≤ζ(G)α2,∀ α∈(0,1) (7)

and therefore, Condition (1-iii) is reduced to

 ζ(supp(Σres))×maxk≠l|Σkl|>λ(|Σkl|−λ√ΣkkΣll)2≤mini≠j|Σij|≤λλ−|Σij|√ΣiiΣjj.
###### Lemma 3.

Consider a matrix with a positive-definite completion and with unit diagonal entries. Define and . There exist strictly positive constant numbers and such that is sign-consistent if and the absolute value of the off-diagonal nonzero entries of is lower bounded by . This implies that Condition (i-ii) is satisfied if and

 γ(supp(Σres))×maxk≠l|Σkl|>λ(|Σkl|−λ√ΣkkΣll)2≤mini≠j|Σij|>λ|Σij|−λ√ΣiiΣjj. (8)

For simplicity of notation, define and . Assuming that , Conditions (1-ii) and (1-iii) of Theorem 1 are guaranteed to be satisfied if

 ζ(supp(Σres))≤1r2⋅λ−σk+1Σmax(σ1−λΣmax)2,γ(supp(Σres))≤1r2⋅σk−λΣmax(σ1−λΣmax)2, (9)

which is equivalent to

 max{γ(supp(Σres)),ζ(supp(Σres))}≤2r2⋅σk−σk+1Σmax(2σ1−σk−σk+1Σmax)2.

for the choice . Consider the set

 T={|Σij| ∣∣ i=1,2,...,d−1, j=i+1,...,d}.

This set has elements. The cardinality of , as a subset of , is smaller than the cardinality of by a factor of . Combined with the fact that for every , this implies that the term is expected to be small and its square is likely to be much smaller than 1, provided that the elements of are sufficiently spread. If the number is relatively smaller than the gap and , then (7) and as a result Conditions (1-ii) and (1-iii) would be satisfied. The satisfaction of this condition will be studied for acyclic graphs in the next section.

### 3.2 Closed-form Solution: Acyclic Sparsity Graphs

In the previous subsection, we provided a set of sufficient conditions for the equivalence of the GL and thresholding methods. Although these conditions are merely based on the known parameters of the problem, i.e., the regularization coefficient and sample covariance matrix, their verification is contingent upon knowing the value of and whether is sign-consistent and has a positive-definite completion. The objective of this part is to greatly simplify the conditions in the case where the thresholded sample covariance matrix has an acyclic support graph. First, notice that if is positive-definite, it has a trivial positive-definite completion. Furthermore, we will prove that in Lemma 2 is equal to 1 when is acyclic. This reduces Condition (1-iii) to the simple inequality

 ∥~Σres∥2max≤mini≠j|Σij|≤λλ−|Σij|√ΣiiΣjj,

which can be verified efficiently and is expected to hold in practice (see Section 4). Then, we will show that the sign-consistency of is automatically implied by the fact that it has a positive-definite completion if is acyclic.

###### Lemma 4.

Given an arbitrary acyclic graph , the relation

 β(G,α)≤α2 (10)

holds for every . Furthermore, strict equality holds for (10) if includes a path of length at least 2.

Sketch of the Proof: In what follows, we will provide a sketch of the main idea behind the proof of Lemma 4. The detailed analysis can be found in the Appendix. Without loss of generality, one can assume that is connected. Otherwise, the subsequent argument can be made for every connected component of . Consider a matrix that satisfies the conditions delineated in Definition 7, i.e. 1) it has a positive-definite completion and hence, is inverse-consistent (see Lemma 1), 2) it has unit diagonal entries, 3) the absolute value of its off-diagonal elements is upper bounded by , and 4) . The key idea behind the proof of Lemma 4 lies in the fact that, due to the acyclic structure of , one can explicitly characterize the inverse-consistent complement of . In particular, it can be shown that the inverse-consistent complement of has the following explicit formula: for every , is equal to the multiplication of the off-diagonal elements of corresponding to the edges in the unique path between the nodes and in . This key insight immediately results in the statement of Lemma 4: the length of the path between nodes and is lower bounded by 2 and therefore, . Furthermore, it is easy to see that if includes a path of length at least 2, can be chosen such that for some , we have

Lemma 4 is at the core of our subsequent arguments. It shows that the function has a simple and explicit formula since its inverse-consistent complement can be easily obtained. Furthermore, it will be used to derive approximate inverse-consistent complement of the matrices with sparse, but not necessarily acyclic support graphs.

###### Lemma 5.

Condition (1-ii) of Theorem 1 is implied by its Condition (1-i) if the graph is acyclic.

Proof: Consider an arbitrary matrix with a positive-definite completion. It suffices to show that if is acyclic, then is sign-consistent. To this end, consider the matrix introduced in the proof of Lemma 1, which is indeed the unique inverse-consistent complement of . For an arbitrary pair , define a diagonal matrix as follows:

• Consider the graph , which is obtained from the acyclic graph by removing its edge . The resulting graph is disconnected because there is no path between nodes and .

• Divide the disconnected graph into two groups 1 and 2 such that group 1 contains node and group 2 includes node 2.

• For every , define as 1 if is in group 1, and as -1 otherwise.

In light of Lemma 1, is the unique solution of (4). Similarly, is a feasible point for (4). As a result, the following inequality must hold

 {trace(M(M+Πopt)−1)−logdet((M+Πopt)−1)} −{trace(MΦ(M+Πopt)−1Φ)−logdet(Φ(M+Πopt)−1Φ)}<0.

It is easy to verify that the left side of the above inequality is equal to twice the product of the entries of and . This implies that the entries of and have opposite signs. As a result, is sign-consistent.

###### Definition 10.

Define as a symmetric matrix whose entry is equal to for every and it is equal to zero otherwise.

The next result of this paper is a consequence of Lemmas 4 and 5 and Theorem 1.

###### Theorem 2.

Assume that the graph is acyclic and the matrix is positive-definite. Then, the relation holds and the optimal solution of the GL can be computed via the explicit formula

 Soptij=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1Σii⎛⎝1+∑(i,m)∈Eopt(Σresim)2ΣiiΣmm−(Σresim)2⎞⎠ifi=j,−ΣresijΣiiΣjj−(Σresij)2if(i,j)∈Eopt,\par0otherwise, (11)

where and denote the edge sets of and , respectively.

When the regularization parameter is large, the graph is expected to be sparse and possibly acyclic. In this case, the matrix is sparse with small nonzero entries. If is positive-definite and is acyclic, Theorem 2 reveals two important properties of the solution of the GL: 1) its support graph is contained in the sparsity graph of the thresholded sample covariance matrix, and 2) the entries of this matrix can be found using the explicit formula (11). However, this formula requires to know the locations of the nonzero elements of . In what follows, we will replace the assumptions of the above theorem with easily verifiable rules that are independent from the optimal solution or the locations of its nonzero entries. Furthermore, it will be shown that these conditions are expected to hold when is large enough, i.e., if a sparse matrix is sought.

###### Theorem 3.

Assume that the following conditions are satisfied:

• Condition 2-i. The graph is acyclic.

• Condition 2-ii. is positive-definite.

• Condition 2-iii. .

Then, the sparsity pattern of the optimal solution corresponds to the sparsity pattern of and, in addition, can be obtained via the explicit formula (11).

The above theorem states that if a sparse graph is sought, then as long as some easy-to-verify conditions are met, there is an explicit formula for the optimal solution. It will later be shown that Condition (2-i) is exactly or approximately satisfied if the regularization coefficient is sufficiently large. Condition (2-ii) implies that the eigenvalues of the normalized residue of with respect to should be greater than -1. This condition is expected to be automatically satisfied since most of the elements of are equal to zero and the nonzero elements have small magnitude. In particular, this condition is satisfied if is diagonally dominant. Finally, using (8), it can be verified that Condition (2-iii) is satisfied if

 (2σ1−σk−σk+1Σmax)2σk−σk+1Σmax≤2r2. (12)

Similar to the arguments made in the previous subsection, (12) shows that Condition (2-iii) is satisfied if is small. This is expected to hold in practice since the choice of entails that is much smaller than . Under such circumstances, one can use Theorem 3 to obtain the solution of the GL without having to solve (2) numerically.

Having computed the sample covariance matrix, we will next show that checking the conditions in Theorem 3 and finding using (11) can all be carried out efficiently.

###### Corollary 1.

Given and , the total time complexity of checking the conditions in Theorem 3 and finding using (11) is .

Another line of work has been devoted to studying the connectivity structure of the optimal solution of the GL. In particular, [24] and [25] have shown that the connected components induced by thresholding the covariance matrix and those in the support graph of the optimal solution of the GL lead to the same vertex partitioning. Although this result does not require any particular condition, it cannot provide any information about the edge structure of the support graph and one needs to solve (2) for each connected component using an iterative algorithm, which may take up to per iteration [16, 17, 24]. Corollary 1 states that this complexity could be reduced significantly for sparse graphs.

###### Remark 2.

The results introduced in Theorem 1 can indeed be categorized as a set of “safe rules” that correctly determine sparsity pattern of the optimal solution of the GL. These rules are subsequently reduced to a set of easily verifiable conditions in Theorem 3

to safely obtain the correct sparsity pattern of the acyclic components in the optimal solution. On the other hand, there is a large body of literature on simple and cheap safe rules to pre-screen and simplify the sparse learning and estimation problems, including Lasso, logistic regression, support vector machine, group Lasso, etc

[26, 27, 28, 29]. Roughly speaking, these methods are based on constructing a sequence of safe regions that encompass the optimal solution for the dual of the problem at hand. These safe regions, together with the Karush–-Kuhn–-Tucker (KKT) conditions, give rise to a set of rules that facilitate inferring the sparsity pattern of the optimal solution. Our results are similar to these methods since we also analyze the special structure of the KKT conditions and resort to the dual of the GL to obtain the correct sparsity structure of the optimal solution. However, according to the seminal work [29], most of the developed results on safe screening rules rely on strong Lipschitz assumptions on the objective function; an assumption that is violated in the GL. This calls for a new machinery to derive theoretically correct rules for this problem; a goal that is at the core of Theorems 1 and 3.

### 3.3 Approximate Closed-form Solution: Sparse Graphs

In the preceding subsection, it was shown that, under some mild assumptions, the GL has an explicit closed-form solution if the support graph of the thresholded sample covariance matrix is acyclic. In this part, a similar approach will be taken to find approximate solutions of the GL with an arbitrary underlying sparsity graph. In particular, by closely examining the hard-to-check conditions of Theorem 1, a set of simple and easy-to-verify surrogates will be introduced which give rise to an approximate closed-form solution for the general sparse GL. Furthermore, we will derive a strong upper bound on the approximation error and show that it decreases exponentially fast with respect to the length of the minimum-length cycle in the support graph of the thresholded sample covariance matrix. Indeed, the formula obtained earlier for acyclic graphs could be regarded as a by-product of this generalization since the length of the minimum-length cycle can be considered as infinity for such graphs. The significance of this result is twofold:

• Recall that the support graph corresponding to the optimal solution of the GL is sparse (but not necessarily acyclic) for a large regularization coefficient. In this case, the approximate error is provably small and the derived closed-form solution can serve as a good approximation for the exact solution of the GL. This will later be demonstrated in different simulations.

• The performance and runtime of numerical (iterative) algorithms for solving the GL heavily depend on their initializations. It is known that if the initial point is chosen close enough to the optimal solution, these algorithms converge to the optimal solution in just a few iterations [16, 23, 30]. The approximate closed-form solution designed in this paper can be used as an initial point for the existing numerical algorithms to significantly improve their runtime.

The proposed approximate solution for the GL with an arbitrary support graph has the following form:

 Aij=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1Σii⎛⎝1+∑(i,m)∈Eopt(Σresim)2ΣiiΣmm−(Σresim)2⎞⎠ifi=j,−ΣresijΣiiΣjj−(Σresij)2if(i,j)∈Eres,\par0otherwise. (13)

The definition of this matrix does not make any assumption on the structure of the graph . Recall that in the above formula is the shorthand notation for . As a result, the matrix is a function of . To prove that the above matrix is an approximate solution of the GL, a few steps need to be taken. First, recall that—according to the proof of Lemma 4—it is possible to explicitly build the inverse-consistent complement of the thresholded sample covariance matrix if its sparsity graph is acyclic. This matrix serves as a certificate to confirm that the explicit solution (13) indeed satisfies the KKT conditions for the GL. By adopting a similar approach, it will then be proved that if the support graph of the thresholded sample covariance matrix is sparse, but not necessarily acyclic, one can find an approximate inverse-consistent complement of the proposed closed-form solution to approximately satisfy the KKT conditions.

###### Definition 11.

Given a number , a matrix is called an -relaxed inverse of matrix if such that for every .

The next lemma offers optimality (KKT) conditions for the unique solution of the GL.

###### Lemma 6 ([22]).

A matrix is the optimal solution of the GL if and only if it satisfies the following conditions for every

 (Sopt)−1ij=Σij ifi=j, (14a) (Sopt)−1ij=Σij+λ×\rm sign% (Soptij) ifSoptij≠0, (14b) Σij−λ≤(Sopt)−1ij≤Σij+λ ifSoptij=0, (14c)

where denotes the entry of .

The following definition introduces a relaxed version of the first-order optimality conditions given in (14).

###### Definition 12.

Given a number , it is said that the matrix satisfies the -relaxed KKT conditions for the GL problem if there exists a matrix such that

• is an -relaxed inverse of the matrix .

• The pair satisfies the conditions

 Bij=Σij ifi=j, (15a) |Bij−(Σij+λ×sign(Aij))|≤ϵ ifAij≠0, (15b) |Bij−Σij|≤λ+ϵ ifAij=0. (15c)

By leveraging the above definition, the objective is to prove that the explicit solution introduced in (13) satisfies the -relaxed KKT conditions for some number to be defined later.

###### Definition 13.

Given a graph , define the function as the length of the minimum-length cycle of (the number is set to if is acyclic). Let refer to the maximum degree of . Furthermore, define as the set of all simple paths between nodes and in , and denote the maximum of over all pairs as .

Define and as the maximum and minimum diagonal elements of , respectively.

###### Theorem 4.

Under the assumption , the explicit solution (13) satisfies the -relaxed KKT conditions for the GL with chosen as

 ϵ=max{Σmax,√ΣmaxΣmin}⋅δ⋅(Pmax(\rm supp(Σres))−1)⋅(∥~Σres∥max)⌈c(\rm supp(Σres))2⌉, (16)

where

 δ=1+deg(\rm supp(Σres))⋅∥~Σres∥2max1−∥~Σres∥2max+(deg(\rm supp(Σres))−1)1−∥~Σres∥2max, (17)

if the following conditions are satisfied:

• Condition 3-i. is positive-definite.

• Condition 3-ii. .

The number given in Theorem 4 is comprised of different parts:

• : Notice that is strictly less than 1 and is large when a sparse graph is sought. Therefore, is expected to be small for sparse graphs. Under this assumption, we have .

• : It is straightforward to verify that is a non-decreasing function of . This is due to the fact that as increases, becomes sparser and this results in a support graph with fewer edges. In particular, if , then for and for almost surely.

• and : These two parameters are also non-decreasing functions of and likely to be small for large . For a small , the numbers and could be on the order of and , respectively. However, these values are expected to be small for sparse graphs. In particular, it is easy to verify that for nonempty and acyclic graphs, .

The above observations imply that if is large enough and the support graph of is sparse, (13) serves as a good approximation of the optimal solution of the GL. In other words, it results from (16) that if has a structure that is close to an acyclic graph, i.e., it has only a few cycles with moderate lengths, we have . In Section 4, we will present illustrative examples to show the accuracy of the closed-form approximate solution with respect to the size of the cycles in the sparsity graph.

Consider the matrix given in (13), and let and denote its minimum and maximum eigenvalues, respectively. If , then (recall that collects the diagonal entries of ) and subsequently . Since is a continuous function of , there exists a number in the interval such that the matrix (implicitly defined based on ) is positive-definite for every . The following theorem further elaborates on the connection between the closed-form formula and the optimal solution of the GL.

###### Theorem 5.

There exists an strictly positive number such that, for every , the matrix given in (13) is the optimal solution of the GL problem after replacing with some perturbed matrix that satisfies the inequality

 ∥Σ−^Σ∥2≤dmax(A)(1μmin(A)+1)ϵ, (18)

where is the maximum vertex cardinality of the connected components in the graph and is given in (16). Furthermore, 18 implies that

 f(A)−f∗≤(μmax(A)+μmax(Sopt))dmax(A)(1μmin(A)+1)ϵ, (19)

where and are the objective functions of the GL evaluated at and the optimal solution, respectively.

As mentioned before, if a sparse solution is sought for the GL, the regularization coefficient would be large and this helps with the satisfaction of the inequality . In fact, it will be shown through different simulations that is small in practice and hence, this condition is not restrictive. Under this circumstance, Theorem 5 states that the easy-to-construct matrix is 1) the exact optimal solution of the GL problem with a perturbed sample covariance matrix, and 2) it is the approximate solution of the GL with the original sample covariance matrix. The magnitudes of this perturbation and approximation error are a function of , , , , and . Furthermore, it should be clear that and are functions of and (we dropped this dependency for simplicity of notation). Recall that the disjoint components (or the vertex partitions) of satisfy a nested property: given , the components of for are nested within the components of for (see [24] for a simple proof of this statement). This implies that