In recent years, there has been a great deal of interest in developing computationally efficient methods to analyze large-scale and high-dimensional data. The data collected in practice is often overwhelmingly large. Therefore, designing simple, yet informative models for describing the underlying structure of data is of significant importance. Hence, sparsity-promoting techniques have become an essential part of inference and learning methods. These techniques have been widely-used in data mining2], functional connectivity of the human brain , distributed controller design [4, 5], transportation systems , and compressive sensing . On the other hand, in many applications, the number of available data samples is much smaller than the dimension of the data. This implies that most of the statistical learning techniques, which are proven to be consistent with the true structure of the data fail dramatically in practice. This is due to the fact that most of the convergence results in the inference methods are contingent upon the availability of a sufficient number of samples, which may not be the case in practice. In an effort to overcome this issue, sparsity-inducing penalty functions are often used to arrive at a parsimonious graphical model for the available data. Graphical Lasso (GL) [8, 9] is one of the most widely-used methods for sparse estimation of the inverse covariance matrices via the augmentation of a Lasso-type penalty function. It is known that the GL can be computationally prohibitive for the large-scale problems, which limits its applicability in practice. Recently, it has been shown in various applications, such as brain connectivity networks, electrical circuits, and transportation networks, that the thresholding technique and the GL lead to the same sparsity structure [10, 6]. Moreover,  shows that under some conditions, a simple thresholding of the sample covariance matrix will result in the same sparsity pattern as the optimal solution of the GL. These conditions have been modified in  to depend only on the sample covariance matrix (and not the optimal solution of the GL). Based on this equivalence,  introduces a closed-form solution for the GL, the exactness of which depends on the sparsity structure of the thresholded sample covariance matrix. In another line of work,  and  consider the disjoint components of the thresholded sample covariance matrix and show that the GL can be solved independently for each of the disjoint components. Although this result does not require additional conditions on the structure, its applicability is limited since it does not reveal any information about the connectivity of the sparsity graph corresponding to the optimal solution of the GL.
I-a Problem Formulation
Consider a random vector
with an underlying multivariate Gaussian distribution. Letdenote the covariance matrix of this random vector. Without loss of generality, we assume that has a zero mean. The goal is to estimate the entries of based on independent samples of . The sparsity pattern of
determines which random variables inare conditionally independent. In particular, if the entry of is zero, it means that and are conditionally independent, given the remaining entries of (the value of this entry is proportional to the partial correlation between and ). In this paper, we assume that is sparse and non-singular. The problem of studying the conditional independence of different entries of is hard in practice due to the fact that the true covariance matrix is rarely known a priori. Therefore, the sample covariance matrix must instead be used to estimate the true covariance matrix. Let denote the sample covariance matrix. To estimate , consider the optimization problem
The optimal solution of the above problem is equal to . However, the number of available samples in many applications is smaller than the dimension of . This makes ill-conditioned or even singular, which would lead to large or unbounded entries for the optimal solution of (1). Furthermore, although is assumed to be sparse, a small difference between and would potentially make highly dense. In an effort to address the aforementioned issues, consider the -regularized version of (1)
where is a regularization coefficient. Let the optimal solution of (2) be denoted by . This problem is known as Graphical Lasso (GL). The term in the objective function is defined as the summation of the absolute values of the off-diagonal entries in . This additional penalty acts as a surrogate for promoting sparsity in the off-diagonal elements of , while ensuring that the problem is well-defined even with a singular input .
It is well-known that the GL is computationally prohibitive for large-scale problems. One way to circumvent the problem of solving the highly-complex GL is to simply threshold the sample covariance matrix in order to obtain a candidate structure for the sparsity pattern of the optimal solution to the GL. It has been shown in several real-world problems, including brain connectivity networks and topology identification of electrical circuits [3, 10], that the thresholding method can correctly identify the nonzero pattern of . Recently, we have shown that the sparsity structure of the thresholding method coincides with that of the GL  under some conditions on the sample covariance matrix. Although these conditions are not easy to verify, it is shown that they are generically satisfied when a sparse solution for the GL is sought, or equivalently, when the regularization parameter in (2) is large. Based on this observation,  derives a closed-form solution for the GL problem that is globally or near-globally optimal for the GL, depending on the structure of the sample covariance matrix.
In this paper, we generalize the results of  to the cases where the thresholded sample covariance matrix has a chordal structure. A matrix has a chordal structure if every cycle in its support graph with length of at least 4 has a chord. Clearly, this class of sparsity structures includes acyclic graphs. First, we revisit the conditions introduced in  for the equivalence of the sparsity structures found using the GL and the simple method of thresholding. We show that the conditions for this equivalence can be significantly simplified when the support graph of the thresholded covariance matrix is chordal. Furthermore, we show that under some mild assumptions, these conditions are automatically satisfied as the size of the sample covariance matrix grows, provided that they possess sparse and chordal structures. In the second part of the paper, we generalize the closed-form solution of the GL with acyclic thresholded sample covariance matrix to those with chordal structures. More specifically, we show that can be obtained using a closed-form recursive formula when the thresholded sample covariance matrix has a chordal structure. As it is pointed out in , most of the numerical algorithms for solving the GL has the worst case complexity of at least . We show that the recursive formula requires a number of iterations growing linearly in the size of the problem, and that the complexity of each iteration is dependent on the size of the maximum clique in the sparsity graph. Therefore, given the thresholded sample covariance matrix (which can be obtained while constructing the sample covariance matrix), the complexity of solving the GL for sparse and chordal structures reduces from to . In fact, we show the graceful scalability of the proposed recursive method in large-scale problems. Specifically, we show that, on average, the proposed method outperforms the best known algorithm for solving the GL by at least a factor of 16 with the sample correlation sizes between and .
The Graphical Lasso technique is commonly-used for estimating the inverse covariance matrices of Gaussian distributions. However, a similar learning method can be employed for data samples with more general underlying distributions. More precisely, the GL corresponds to the minimization of the -regularized log-determinant Bregman divergence, which is a widely-used metric for measuring the distance between the true and estimated parameters of a problem . Therefore, the theoretical results developed in this paper are applicable to more general inference problems.
Notations: , , , and are used to denote the sets of real vectors, symmetric matrices, positive-semidefinite matrices, and positive-definite matrices, respectively. The symbols and refer to the trace and the logarithm of the determinant of the matrix , respectively. The and entries of the vector and matrix are denoted by and , respectively. refers to an identity matrix. The sign of a scalar is shown by . For a set , refers to its cardinality. The inequality () means that is positive-(semi)definite. For a graph , denotes the set of neighbors of node . Given a vector and matrix , define
An index set is a sorted subset of the integers . The number of elements in the index set is denoted by . Given index sets and , we define
A sparsity pattern is a symmetric binary matrix. A matrix (not necessarily symmetric) is said to have sparsity pattern if whenever ; the set (resp. ) refers to the matrices (resp. symmetric matrices) with sparsity pattern . The Euclidean projection onto sparsity pattern is denoted as : the element of is zero if , and if .
In this section, we review the properties of sparse and chordal matrices and their connection to the max-det matrix completion problem.
Ii-a Sparse Cholesky factorization
Consider solving a symmetric positive definite linear system
by Gaussian elimination. The standard procedure comprises a factorization step, where is decomposed into the (unique) Cholesky factor matrices , in which is diagonal and is lower-triangular with a unit diagonal, and a substitution step, where the two triangular systems of linear equations and are solved to yield .
In the case where is sparse, the Cholesky factor is often also sparse. It is common to store the sparsity pattern of in the compressed column storage format: a set of indices in which
encodes the locations of off-diagonal nonzeros in the column of . (The diagonal elements are not included because the matrix has a unit diagonal by definition).
After storage has been allocated and the sparsity structure determined, the numerical values of and are computed using a sparse Cholesky factorization algorithm. This requires the use of the associated elimination tree , which is a rooted tree (or forest) on vertices, with edges defined to connect each vertex to its parent at the vertex (except root nodes, which have “0” as their parent), as in
in which indicates the (numerically) smallest index in the index set . The elimination tree encodes the dependency information between different columns of , thereby allowing information to be passed without explicitly forming the matrix.
Ii-B Chordal sparsity patterns
The support or sparsity graph of , denoted by , is defined as a graph with the vertex set and the edge set , where if and only if and . The pattern is said to be chordal if its graph does not contain an induced cycle with length greater than three. If is not chordal, then we may add nonzeros to it until it becomes chordal; the resulting pattern is called a chordal completion (or chordal embedding or triangulation) of . Any chordal completion with at most nonzeros is a sparse chordal completion of .
A sparsity pattern is said to factor without fill if every with sparsity pattern can be factored into such that also has sparsity pattern . If a sparsity pattern factors without fill, then it is chordal. Conversely, if a sparsity pattern is chordal, then there exists a permutation matrix such that factors without fill . This permutation matrix is called the perfect elimination ordering of the chordal sparsity pattern .
Ii-C Recursive solution of the max-det problem.
An important application of chordal sparsity is the efficient solution of the maximum determinant matrix completion problem, written
for a given large-and-sparse matrix with sparsity pattern . The optimal solution of the above optimization (when it exists) is called the max-det matrix completion of , and is unique. The Lagrangian dual of this problem is the following
with first-order optimality condition
Strong duality gives a straightforward relation back to the primal
Note that while is (in general) a dense matrix, is always sparse. Instead of attempting to solve the primal problem (8) for a dense matrix, we may opt to solve the dual problem (9) for a sparse matrix satisfying the optimality condition (10). In the case that the sparsity pattern factors without fill,  showed that (10) is actually a linear system of equations over the Cholesky factor and of the solution ; their numerical values can be explicitly computed using a recursive formula.
(, Algorithm 4.2)
Input. Matrix that has a positive definite completion.
Output. The Cholesky factors and of that satisfy .
Algorithm. Iterate over in reverse, i.e. starting from and ending in . For each , compute and the column of from
and compute the update matrices
for each satisfying , i.e. each child of in the elimination tree.
Of course, if the sparsity pattern is chordal, then we may find the perfect elimination ordering in linear time , and apply the above algorithm to the matrix , whose sparsity pattern does indeed factor without fill.
The algorithm takes steps, and the step requires a size- linear solve and vector-vector product. The treewidth of the sparsity graph is defined as and has the interpretation of the largest clique in minus one. Combined, the algorithm has time complexity . This means that the matrix completion algorithm is linear-time if the treewidth of is in the order of .
Iii Main Results
To streamline the presentation, with no loss of generality we assume that used in (2) is the sample correlation matrix. This means that the diagonal elements of are normalized to 1 and the off-diagonal elements are between and . The results of this paper can readily be generalized to an arbitrary sample covariance matrix after appropriate rescaling. The following definitions are borrowed from .
A matrix is called inverse-consistent if there exists a matrix with zero diagonal elements such that
where is the complement of . The matrix is called an inverse-consistent complement of and is denoted as .
Moreover, is called sign-consistent if the entries of and are nonzero and have opposite signs for every .
Consider the matrix:
We show that is both inverse- and sign-consistent. Consider the matrix defined as
can be written as
The sparsity graph of is the complement of that of .
The sparsity graphs of and are equivalent.
The nonzero off-diagonal entries of and have opposite signs.
Therefore, it can be inferred that is both inverse- and sign-consistent, and is its inverse-consistent complement.
In , it has been shown that every positive definite matrix has a unique inverse-consistent complement.
Given a graph and a scalar , define as the maximum of over all inverse-consistent positive-definite matrices with the diagonal entries equal to 1 such that and .
Without loss of generality and due to the non-singularity of , one can assume that all elements of are nonzero. Let be the sorted upper triangular entries of such that
Define the residue of at level relative to as a matrix whose entry is equal to if , and equal to 0 otherwise.
Notice that is the soft-thresholded sample correlation matrix with threshold . For simplicity of notation, we omit the arguments and in whenever the equivalence is implied from the context. The following theorem is borrowed from .
The thresholding method and the GL have the same sparsity patterns if the following conditions are satisfied for :
Condition 1-i: is positive definite.
Condition 1-ii: is sign-consistent.
Condition 1-iii: The relation
In , it is pointed out that the aforementioned conditions in Theorem 1 are expected to hold if a sparse solution for the GL problem is sought. However, efficient verification of these conditions is yet to be addressed in practice. It has been observed that the last condition plays the most important role in verifying the optimality conditions for the sparsity pattern of the thresholded sample correlation matrix.
Iii-a Upper Bound for in Chordal Graphs
In what follows, we derive an upper bound on for chordal graphs and show that under some mild assumptions, Condition 1-iii is satisfied as the size of the sample correlation matrix grows.
Suppose that the following conditions hold:
Condition 2-i: is chordal.
Condition 2-ii: .
Then, we have
The proof is provided in Appendix.
Conditions 2-ii and 2-iii are guaranteed to be satisfied for small values of . In such circumstances, chordal structure for is enough to verify the validity of (18). Based on this theorem, the next corollary shows that the Condition 1-iii in Theorem 1 is guaranteed to be satisfied when the support graph of the sample correlation matrix is sparse and large enough.
Suppose that the following conditions hold for some and :
Then, there exists a such that for every , the Condition 1-iii in Theorem 1 is satisfied.
The proof is provided in Appendix.
Corollary 1 implies that, if the sample correlation matrix is large enough and the rate of decrease in , as a function of the dimension of the data, is not much smaller than that of , then, Condition 1-iii is automatically satisfied. For instance, suppose that in (19a). Then, Corollary 1 implies that for large enough values of , Condition 1-iii is satisfied if and there exists a such that .
Although (18) only holds for chordal graphs, one can generalize Theorem 2 to non-chordal graphs, under some additional assumptions. In particular, suppose that is the chordal completion of a non-chordal graph . Then, one can easily verify that (18) holds for if . Indeed, we could show that the monotonic behavior of is maintained under fairly general conditions. Due to the space restrictions, this generalization is not included in this paper.
Iii-B Max-Det Matrix Completion for Graphical Lasso
In this subsection, we show that if the equivalence between the thresholding method and GL holds, the optimal solution of the GL can be obtained using Algorithm 1.
Assume that the thresholded sample correlation matrix has the same sparsity pattern as the optimal solution of the GL and . Then,
is the unique max-det completion of .
Algorithm 1 can be used to find if is chordal.
The proof is provided in Appendix.
Recall that the main goal of the GL is to promote the sparsity structure of the inverse correlation matrix. In order to obtain a sparse solution, the regularization coefficient should be large, relative to the absolute values of the off-diagonal elements in the sample correlation matrix. Under such circumstances, the conditions delineated in Theorem 1 are satisfied and the sparsity pattern of the simple thresholding technique corresponds to that of the GL. Theorem 3 uses this result to show that for large values of , instead of merely identifying its sparsity structure, the thresholded sample correlation matrix can be further exploited to find the optimal solution of the GL problem by solving its corresponding max-det matrix completion problem. Note that the first part of Theorem 3 is independent of the structure of the thresholded sample correlation matrix. However, the second part of the theorem suggests that solving its corresponding max-det matrix completion problem can be much easier than the GL problem and can be performed in linear time using Algorithm 1 when the thresholded sample correlation matrix has a sparse and chordal structure.
While the focus of this paper is on the thresholded sample correlation matrices with chordal structures, the presented method may be extended to matrices with non-chordal sparsity patterns. Note that for non-chordal structures, the provided recursive formula does not necessarily result in the optimal solution. However, it has been shown in  that efficient implementations of Newton and conjugate gradient methods for max-det matrix completion problem are possible when the sparsity structure of the problem has a sparse chordal completion. The detailed analysis of this extension is left as future work. Furthermore, as it is pointed out in  and , the disjoint components of the sparsity graph induced by the thresholding of sample correlation matrix can be treated independently since the GL is decomposed into multiple smaller size problems over these disjoint components. Therefore, the proposed method can be applied to every chordal component even if the overall sparsity graph of the thresholded sample correlation matrix does not benefit from a chordal structure.
Iv Numerical Results
|test case||matrix size||max clique size||run. time||run. time||opt. gap||speedup||run. time||opt. gap||speedup|
Using the method proposed in this paper, we solve the GL problem on various large-scale problems whose thresholded sample correlation matrices have chordal structures. All the test cases are collected from the SuiteSparse Matrix Collection  and MATPOWER package [20, 21]. These are publicly available and widely-used datasets for large-and-sparse matrices from real-world problems. The simulations are run on a laptop computer with an Intel Core i7 quad-core 2.50 GHz CPU and 16GB RAM. The results reported in this section are for a serial implementation in MATLAB.
Iv-a Data Generation
For each test case, we take the following steps to design the sample correlation matrix.
First, the nonzero structure of the matrix for a given test case is exploited and a Symbolic Cholesky Factorization is performed to arrive at a chordal embedding of the given structure . In other words, we augment the sparsity graph corresponding to the considered test case with additional edges to obtain a sparse chordal completion of the graph.
The elements of the sample correlation matrix corresponding to the edges in the extended graph are chosen randomly from the union of the intervals and . The rest of the elements are randomly chosen from the interval .
The diagonal elements of the sample correlation matrix are elevated according to the off-diagonal elements in order to make the sample correlation matrix positive semidefinite. The resulted matrix is normalized, if necessary.
We consider different test cases corresponding to various real-world problems in materials science, power networks, circuit simulation, optimal control, fluid dynamics, social networks, and chemical process simulation. The size of the variable matrices in the GL problem for the investigated problems ranges from (with approximately 700 thousands variable elements) to (with approximately 447 million variable elements). We compare the running time and objective function of our proposed method with two other state-of-the-art algorithms, namely the GLASSO  and QUIC  algorithms (downloaded from http://statweb.stanford.edu/tibs/glasso/ and http://bigdata.ices.utexas.edu/software/1035/, respectively). The GLASSO algorithm is the most widely-used algorithm for the GL, while the QUIC algorithm is commonly regarded as the fastest solver for the GL. Define the relative optimality gap as the normalized difference between the objective functions of the proposed solution and the solutions that are obtained by the other two methods. We consider a 2-hour time limit for the solvers.
Table I shows the results of our simulations. It can be observed that the proposed recursive method significantly outperforms the GLASSO and QUIC algorithms in terms of running time, while achieving a negligible relative optimality gap in most of the test cases. More specifically, for the first 8 cases, the proposed recursive method is times faster than the QUIC. For the largest test case, QUIC does not find the optimal solution within the 2-hour time limit while the proposed recursive formula obtains the optimal solution in less than 2 minutes. Furthermore, the recursive method is 726 times faster than GLASSO algorithm for the first 6 test cases. However, this algorithm does not find the optimal solution for the 3 largest test cases within the 2-hour time limit.
In many graphical learning problems, the goal is to obtain a sparsity graph that describes the conditional independence of different elements in the available dataset via sparse inverse covariance estimation. The Graphical Lasso (GL) is one of the most commonly-used methods for addressing this problem. It is known that, in high-dimensional settings, the GL is computationally-prohibitive. A cheap alternative method for finding the sparsity pattern of the inverse covariance matrix is a simple thresholding method performed on the sample covariance matrix of the data. Recently, we have provided sufficient conditions under which the thresholding is equivalent to the GL in terms of the sparsity pattern of the graphical model. Based on this result, we have shown that the GL has a closed-form solution when the thresholded sample covariance matrix is acyclic. In this paper, this result is generalized to the problems where the thresholding results in a chordal structure. It is shown that the sufficient conditions for the equivalence of the thresholding and GL can be significantly simplified for chordal structures and is expected to hold as the dimension of the data increases. Furthermore, it is shown that the GL can be reduced to a maximum determinant matrix completion problem when the thresholding is equivalent to the GL, and for chordal structures, the corresponding matrix completion problem has a simple recursive formula. The performance of the derived recursive formula is compared with the other commonly-used methods and shown that, for the large-scale GL problems, the proposed method significantly outperforms other methods in terms of their running times.
-  J. Garcke, M. Griebel, and M. Thess, “Data mining with sparse grids,” Computing, vol. 67, no. 3, pp. 225–253, 2001.
J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,”Proceedings of the IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.
-  S. Sojoudi and J. Doyle, “Study of the brain functional network using synthetic data,” 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 350–357, 2014.
-  M. Fardad, F. Lin, and M. R. Jovanović, “Sparsity-promoting optimal control for a class of distributed systems,” American Control Conference, pp. 2050–2055, 2011.
-  S. Fattahi and J. Lavaei, “On the convexity of optimal decentralized control problem and sparsity path,” American Control Conference, 2017.
-  S. Fattahi and S. Sojoudi, “Graphical lasso and thresholding: Equivalence and closed-form solutions,” https://arxiv.org/abs/1708.09479, 2017.
-  E. Candes and J. Romberg, “Sparsity and incoherence in compressive sampling,” Inverse Problems, vol. 23, no. 3, pp. 969–985, 2007.
-  J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through sparse
maximum likelihood estimation for multivariate Gaussian or binary data,”
Journal of Machine learning research, vol. 9, pp. 485–516, 2008.
-  S. Sojoudi, “Equivalence of graphical lasso and thresholding for sparse graphs,” Journal of Machine Learning Research, vol. 17, no. 115, pp. 1–21, 2016.
-  R. Mazumder and T. Hastie, “Exact covariance thresholding into connected components for large-scale graphical lasso,” Journal of Machine Learning Research, vol. 13, pp. 781–794, 2012.
-  D. M. Witten, J. H. Friedman, and N. Simon, “New insights and faster computations for the graphical lasso,” Journal of Computational and Graphical Statistics, vol. 20, no. 4, pp. 892–900, 2011.
-  P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu, “High-dimensional covariance estimation by minimizing -penalized log-determinant divergence,” Electronic Journal of Statistics, vol. 5, pp. 935–980, 2011.
-  J. W. Liu, “The role of elimination trees in sparse factorization,” SIAM Journal on Matrix Analysis and Applications, vol. 11, no. 1, pp. 134–172, 1990.
-  D. Fulkerson and O. Gross, “Incidence matrices and interval graphs,” Pacific journal of mathematics, vol. 15, no. 3, pp. 835–855, 1965.
-  J. D. Andersen, Martin S. and L. Vandenberghe, “Logarithmic barriers for sparse matrix cones,” Optimization Methods and Software, vol. 28, no. 3, pp. 396–423, 2013.
-  L. Vandenberghe and M. S. Andersen, “Chordal graphs and semidefinite optimization,” Foundations and Trends in Optimization, vol. 1, no. 4, pp. 241–433, 2015.
-  L. V. Dahl, Joachim and V. Roychowdhury, “Covariance selection for nonchordal graphs via chordal embedding,” Optimization Methods Software, vol. 23, no. 4, pp. 501–520, 2008.
-  T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
-  R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas, “Matpower: Steady-state operations, planning, and analysis tools for power systems research and education,” IEEE Transactions on power systems, vol. 26, no. 1, pp. 12–19, 2011.
-  C. Coffrin, D. Gordon, and P. Scott, “Nesta, the nicta energy system test case archive,” arXiv preprint arXiv:1411.0359, 2014.
-  A. Agrawal, P. Klein, and R. Ravi, “Cutting down on fill using nested dissection: provably good elimination orderings,” Graph Theory and Sparse Matrix Computation, Springer, pp. 31–55, 1993.
-  C. J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar, “Quic: quadratic approximation for sparse inverse covariance estimation,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2911–2947, 2014.
-  M. K. K. M. Fukuda, Mituhiro and K. Nakata, “Exploiting sparsity in semidefinite programming via matrix completion i: General framework,” SIAM Journal on Optimization, vol. 11, no. 3, pp. 647–674, 2001.