How can we learn sparse graphs with enhanced interpretability under the Gaussian Markov random field (GMRF) [GMRF]? This is the central question addressed in this article. A graph, containing a set of vertices and edges, is a mathematical tool to represent the dependencies among components (such as nodes of a network or pixels of an image), through the selection of pairwise relations (edge weights) between each pair of objects (vertices). In particular, the strength of the relation can be expressed in terms of (nonnegative) graph weights. In the present context, graph “sparseness” is an important property because it tends to provide better interpretability, i.e., relative to all possible connections between nodes, only a few edges are non-zero and provide information about the major relationships between objects.
The problem of learning graphs from data has been studied widely in a variety of fields including signal processing, machine learning, and statistics[friedman2008sparse, mazumder2012, Meinshausen_06, JMLR:v9:banerjee08a, GSP_GL_2, wang2012, Sun12, Lake, ortega16, palomar20, lam2009]. Graph learning has been considered in multiple applications such as design of functional brain network architectures [VECCHIO2017206], molecular biology [Mason2007]
, and network anomaly detection[Bhuyan16]. We also refer the reader to [connect, Dong_servey] for comprehensive reviews of graph learning. The graphical model approach [friedman2008sparse, mazumder2012, Meinshausen_06, JMLR:v9:banerjee08a, GSP_GL_2] represents dependencies with the data in graph form and has gained significant popularity owing to two main reasons. First, the graphical model is built upon a solid statistical foundation, so that the edge weights have a physical meaning under certain assumptions. For instance, if the observed data are derived from a GMRF model, the weights are based on partial correlation coefficients [book_Probabilistic_graphical_model]. Second, it provides excellent versatility as it assumes no specific structure on the graph. A particular example of graph learning algorithm is the graphical lasso [friedman2008sparse, Bhuyan14, wang2012, Sun12, Lake], which employs regularization on the edge weights to obtain the sparse inverse covariance matrix of a GMRF model [GMRF]. This approach has been extended and modified in [ortega16] to learn several types of Laplacian matrices, including a formulation where the inverse covariance matrix has a combinatorial graph Laplacian (CGL) structure [ortega16]. As noted earlier, research on sparse graph learning is motivated by the fact that sparsity enhances the interpretability of the learned graphs [Buhlmann_book, ortega16]. All those sparsity-seeking methods exploit convex penalties (such as the norm) mainly due to their mathematical tractability.
To clarify the motivation of the present study, let us turn our attention to sparse linear regression. A plethora of nonconvex alternatives to the
regularization have been proposed to reduce the estimation bias while maintaining the benefit of variance reduction[zhang2010, surveyOfNonconvexPenalty, Mazumder_11, lam2009]. Among them, we focus on the minimax concave (MC) penalty [zhang2010, selesnick] because:
it saturates (i.e., it returns a constant value when the variable being estimated exceeds a given threshold) thereby reducing the estimation biases significantly;
it has been shown to bridge the gap between the and norms in a parametric way [Abe_2020];
it is a weakly convex function [Nurminskii_73_weakly_convex]; more specifically, it is given by subtracting from the norm its Moreau envelope.
Property (iii) is of particular importance from an optimization viewpoint because the overall convexity of the cost function is ensured when the MC penalty is used with strongly convex loss functions, and also because the decomposed form in terms of a convex function and its Moreau envelope is compatible with the efficient operator splitting methods. The MC penalty has been used in various sparse estimation problems, e.g., feature selection with a sparse support vector machine (SVM)[ex_MCP_1_Feature_Selection] and gear fault diagnosis from noisy vibration signals [ex_MCP_2_Bearing_Fault_Diagnosis]. To the best of authors’ knowledge, the quasi-norm for is the only function, excluding the MC penalty, that is known to possess property (ii) above, but it lacks properties (i) and (iii). On the other hand, the smoothly clipped absolute deviation (SCAD) penalty [lam2009] is similar to the MC penalty and it could be an alternative choice, since it actually possesses property (i) as well as the weak-convexity part of property (iii) although further investigations would be needed to determine whether property (ii) and the other part of property (iii) also hold for the SCAD penalty. While nonconvex alternatives to the penalty have been successful in the context of sparse linear regression, their study for graph learning has been limited, and most of the existing graph learning methods use the or regularization. A few exceptions include approaches using the log function [palomar20] or the norm [Shen12]. A use of the SCAD penalty was also mentioned in [lam2009, palomar_MC]. The recent works [palomar_MC, zhang2020learning] have observed that an increase of the regularization parameter of the penalty in the CGL estimation framework ultimately does not lead to a sparse solution and instead produces a dense solution associated with a fully connected graph. Based on this observation, in [palomar_MC, zhang2020learning], it has been shown that the use of the MC penalty (as well as other nonconvex penalties) yields better performance. However, these approaches are based on a nonconvex formulation and thus there is no guarantee that the generated sequence of graphs converges to a global optimum. This motivates us to devise another formulation which benefits from the weak convexity of the MC penalty to guarantee overall convexity of the entire cost so that the generated graphs converge to provably global optimum.
The goal of this article is to present a novel graph learning framework based on a convex formulation involving the nonconvex (but weakly convex) MC penalty to produce sparse graphs, and specifically sparse CGL matrices. Since CGLs are symmetric matrices, we remove this redundancy by representing a CGL matrix using a linear transform of the vector of graph weights corresponding to the upper-triangular part.111Although the one-to-one linear operator for representing the CGL was used in the structured graph learning via Laplacian spectral constraints (SGL) method [palomar20], the proposed method is more efficient (as shown in Section IV) due to the proposed reformulation, which allows to use the primal-dual splitting method [primal-dual] (as shown in Section III). Here, the upper-triangular part represents the undirected relations among nodes and completely characterizes the CGL matrix, so that our estimate is automatically guaranteed to have a Laplacian structure without the need to impose any constraints. This is in sharp contrast to the existing CGL approaches, which typically require both a positive semi-definite constraint and a linear constraint. Our formulation involves the nonconvex MC penalty, instead of the norm, while essentially keeping the same terms (the “nonsmooth” log-determinant term and the linear term) as the graphical lasso formulation but with the linear operator mentioned above. Note here that the negative log-determinant function is differentiable but with non-Lipschitz-continuous gradient. Due to the nonconvexity and the nonsmoothness, the problem cannot be solved directly using existing optimization methods. To circumvent the difficulty, we invoke the classical Moreau’s decomposition and show that the Tikhonov regularization convexifies the overall cost function, reformulating the problem into a canonical form of the primal-dual splitting method [primal-dual]. We present the admissible conditions under which the convergence to the global optimal point is guaranteed by the primal-dual splitting method. Numerical examples show that the proposed method outperforms the conventional CGL method (its -based counterpart) for three types of graph. Compared to the state-of-the-art method, the structured graph learning via Laplacian spectral constraints (SGL) [palomar20], the proposed method achieves comparable or better performance, depending on the type of graph, with up to 40 times shorter CPU time. In addition, experiments with real data show that the method produces a sparser graph than other existing methods.
New features of the present work relative to our preliminary work [koyakumaru_21] include detailed proofs of the mathematical results and refined experimental results as well as additional simulation results using real data.
We present notation, and then show some mathematical tools used in this work. We finally present the primal-dual splitting method which is used to solve the proposed optimization problem to be presented in Section III.
The sets of real numbers and nonnegative real numbers are denoted by and , respectively. The transpose of vector/matrix is denoted by . Given a vector , define the and the norms by := and := , respectively. Similarly, given a matrix with its component denoted by , define the and the Frobenius norms by and , respectively. Given a pair of matrices and , define the inner product . Let and
denote the identity matrix and the vector of ones, respectively, and letrepresent the diagonal matrix consisting of the components of a vector .
We consider undirected weighted graphs with nonnegative edge weights. The graph is composed of a set of nodes , edges , and a symmetric weight matrix with if , and if , where is the number of nodes. Here, for any by convention. CGL is defined by , where
is the degree matrix. CGL has zero row-sums with its minimum eigenvalue also zero which is simple when the graph is connected.
Ii-B Mathematical tools
The conjugate of a function is denoted by , . The set of proper lower semicontinuous convex functions from to (] is denoted by .222 A function is proper if , and lower semicontinuous at if . The proximity operator of of index is defined as follows [combettes_book]:
Uniqueness and existence of the minimizer is guaranteed by the strong convexity and coercivity of . The indicator function with respect to a given set is denoted by
It is clear by definition that . The Moreau envelope of a function of index is defined as follows [combettes_book, Definition 12.20]:
Using the Moreau envelope of , which is the widely known Huber function, the MC penalty [selesnick] is defined as
The nonconvex function here is known to induce a sparser and less biased estimate with respect to the penalty.
Ii-C Primal-dual splitting method
Let and be real Hilbert spaces: in the present case, and . The primal-dual splitting method [primal-dual] solves convex optimization problems in the following form:
where is a differentiable convex function with Lipschitz continuous gradient , and are proximable proper lower semicontinuous convex functions, and is a bounded linear operator with its adjoint operator denoted by . Here, “proximable” means that the proximity operator of the function can be computed easily (in a closed form in the present case). The primal dual splitting method is given in Algorithm 1.
Iii Proposed Algorithms
Due to its structure (i.e., symmetry and zero row-sums), the CGL is completely defined by its upper (or lower) triangular part excluding the main diagonal, or, in other words, by a length- vector, where the CGL is of size . Since all the off-diagonal components of CGL need to be nonnegative, our variable vector is constrained to the nonnegative cone (the nonnegative orthant) . Given this, we define a specific linear operator that maps a nonnegative vector of size to its corresponding CGL. For , for instance, is defined as follows:
Iii-a Problem formulation
The CGL formulation presented in [ortega16] is a popular extension of graphical lasso for imposing a Laplacian constraint. With a slight modification using the linear operator introduced above, the CGL formulation is given by
where , stands for the sample covariance obtained from data, and is the regularization parameter. Note here that is positive definite if and only if the graph of is connected, i.e., is an irreducible matrix, where denotes the Hadamard product.
Replacing the regularization term of Problem by the MC penalty given in (4), the problem reads as follows:333Although the formulation in P has been considered in the literature [palomar_MC, zhang2020learning], it was also considered earlier in the authors’ previous works [koyakumaru_bthesis, koyakumaru_21] as an intermediate step, and the present study is independent from [palomar_MC, zhang2020learning].
We introduce the Tikhonov regularization term , , which also plays a role of convexification as shown below. Introducing the indicator function to accommodate the constraint as well, Problem is transformed into the following unconstrained optimization problem:
By using Moreau’s decomposition [combettes_book, Theorem 14.3], the MC penalty term can be rewritten as
where is the ball of radius- [combettes_book, Example 13.32]. The usefulness of this decomposition of the MC penalty term has been observed also in [suzuki21, kaneko20, yukawa21_alime, komuro_2021_SSP]. Using (10), the function of P can be rewritten as
of which the convexity is ensured clearly by choosing and such that (see Proposition 4 below). The Tikhonov regularization term thus has a convexification property, as mentioned above. On the other hand, the functions and are convex, since the composition of a convex function with an arbitrary affine operator is also a convex function. Hence, under the convexity condition given above, Problem takes the form of (5), and it can be solved by the primal-dual splitting method.
Iii-B Optimization algorithm
The proposed algorithm is derived by applying the primal-dual spitting method to Problem .
Iii-B1 Derivation of
We define the soft thresholding operator for a length- positive vector by
where is the th component of the argument. The convex projection onto the nonnegative cone is given by
Applying Step 1 of Algorithm 1 to Problem yields
The operators and can be computed by using the following propositions.
Let be an arbitrary CGL matrix with its component denoted by . Then, for any such that , it holds that
Proof: See Appendix A.
The proximity operator of of index can be expressed by
Proof: See Appendix B.
Iii-B2 Derivation of
Substituting Moreau’s decomposition [combettes_book, Theorem 14.3]
with into Step 2 of Algorithm 1 yields
Here, the proximity operator can be written in a closed form, as shown in the following proposition.
An application of Proposition 3 to (LABEL:y_np1_op) yields
where is the eigenvalue
of , and with the eigenvectors of .
The proposed algorithm is given in Algorithm 2. Our formulation based on the MC penalty is expected to yield a sparser solution with better interpretability than the conventional -based methods due to the efficient sparsity promoting property of the MC penalty. In addition, our representation of CGL using the linear operator reduces the number of variables approximately by half, while also transforming the positive semi-definite constraint of the graph Laplacian to the nonnegativity constraint . However, our formulation needs complexity due to the need to execute matrix multiplication and eigenvalue decomposition. This computational drawback can be mitigated by using the eigenvalue decomposition method for symmetric matrices [MagoarouGT16]. Note that, in the particular case of , the proposed algorithm gives an alternative way to solve the graphical lasso problem for CGL.
Iii-C Convergence conditions
Convergence is guaranteed under the following conditions.
If , the function is convex. In this case, Algorithm 1 converges to a global minimizer of if the following conditions are jointly satisfied:
Proof: See Appendix D.
From a theoretical side, a use of satisfying the convexity condition shown in Proposition 4 ensures convergence to a global minimizer. From a practical side, on the other hand, a use of violating the convexity condition may yield better performance, as will be seen in Section IV. However, we emphasize that improved performance comes with no theoretical guarantees. A remarkable advantage of the present framework is its flexibility due to the use of powerful convex analytic solver, which allows to extend the presented framework in many possible directions including dynamic graph learning [HallacPBL17, tanaka19_time_varying].
Iv Numerical Examples
We show the efficacy of the proposed method through some experiments with synthetic and real data. We first show the performances of the proposed method for different regularization parameters. We then compare the performance of the proposed method with CGL estimation [ortega16] and SGL estimation [palomar20]. 444We used the implementations of the SGL and CGL algorithms in spectralGraphTopology (https://CRAN.R-project.org/package=spectralGraphTopology).
Iv-a Experiments with synthetic data
Dataset generation: We consider three types of graph: (i) grid graph with nodes connected to their four nearest neighbors (except the nodes at boundaries), (ii) random modular graph (a.k.a. stochastic block model)
with four modules where the nodes are connected across the modules and within each module with probabilities 0.01 and 0.3, respectively, and (iii) Erdös-Rényi graph
with nodes connected to other nodes with probability 0.1. The graph weights are randomly drawn from the uniform distribution over the interval [0.1, 3.0], regarded as the ground-truth graph Laplacianin this experiment. From each graph generated, data are generated from , where denotes the Moore-Penrose pseudo inverse, and the covariance matrix is computed from data. For each type of graph, we randomly generate graphs with nodes using the toolbox given in [perraudin2014gspbox].
The relative error (RE) and F-score (FS) are used as performance measures:
where , and stand for true-positive, false-positive, and false-negative, respectively. Here, RE indicates the discrepancy between the ground-truth graph Laplacian and its estimate , while FS is a measure of accuracy of binary classification (taking values in [0,1]), indicating whether the sparse structures are extracted correctly.
Iv-A1 Performance of the proposed method
We study the impacts of the parameters and of the MC penalty on the performance of the proposed method. We also tested the case of , which makes the entire cost function nonconvex for any , with the other parameters tuned manually (see Table I). To study the impact of , we fix which gave a best performance in the nonconvex case. Since is the algorithm parameter and it only affects the convergence speed, we fix it to . The other parameters are then set to and according to the convexity condition (See Proposition 4). Figure 1 plots the RE and FS curves across in modular graph for different choices of . To study the impact of , on the other hand, we fix and choose the other parameters in the same way as in Fig. 1. Figure 2 plots the RE and FS curves for different choices of under the same conditions as in Fig. 1, with =. In Figs. 1 and 2, the proposed method using the convexity condition attains better performance than the nonconvex case when is small, while the nonconvex case is better when is large. To be more specific, when is small, using larger , or larger , yields better performance. Although the regularization parameter for the Tikhonov regularization needs to be sufficiently large to ensure the convexity of the entire objective, using a that is too large tends to yield a less sparse solution, which means degradation of graph interpretability (cf. Section III-C).
Iv-A2 Comparisons in estimation accuracy
Parameters: The best parameters are chosen manually, see Table I. The generated graphs depend only on , , and , while being independent of the algorithm parameters , , and in principle. It is very important to tune the parameters , , and carefully for better performance, although the generated graph changes gradually as each of those parameters changes. As shown in Section IV-A1, for the proposed method under the convexity condition, is used. We mention that a smaller value of tends to give a better result for RE and FS when is large. Although the convergence is guaranteed under the condition of described in Section III-C, (for which there is no guarantee of convergence to the global minimizer) gave the best performance in the current experiments. Regarding the parameters for CGL and SGL, we follow the parameter selection techniques proposed in [ortega16] and [palomar20], respectively.
Results: Figure 3 shows the ground truth and the learned graph for where is the number of measurements. One can see that the proposed method yields a more accurate graph than the other methods; in particular, the graph obtained by the proposed method is remarkably sparse. Figures 4 – 6 show the performances in RE and FS across for the grid graph , the random modular graph , and the Erdös-Rényi graph , respectively. In Fig. 4, the proposed method (nonconvex) significantly outperforms CGL due to the use of the MC penalty, while the proposed method under the convexity condition achieves approximately the same RE performance as the nonconvex case with the degraded FS performance for large . The performance of SGL is close to that of the proposed method in this case. The proposed method under the convexity condition exhibits approximately the same performance as the proposed method for (the nonconvex case). In Fig. 5, on the other hand, the proposed algorithm outperforms SGL considerably. The proposed method significantly outperforms CGL in FS for large , although those two methods exhibit comparable performances in RE (and in FS as well for small ). In Fig. 6, the proposed method outperforms the other methods in FS for a wide range of values, while the proposed method under the convexity condition achieves approximately the same RE performance as the nonconvex case with its FS performance close to those of CGL and SGL for large . We remark that the difference between the proposed method and CGL is more notable in Fig. 4 than in Figs. 5 and 6 because the graph in Fig. 4 is approximately four times sparser than that of Fig. 5 and twice sparser than that of Fig. 6. Thus, the sparsity assumption is a better match to the actual data in for the graph from Fig. 4. We finally remark that, for CGL and SGL, a small regularization parameter was used because use of large regularization parameters with the norm leads to increased errors, which degrade the quality (e.g., interpretability) of the learned graphs. This is the reason why the graphs obtained by CGL and SGL in Fig. 3 are not sufficiently sparse.
Iv-A3 Comparisons in computation time
We investigate how the computation time with tolerance error changes with the size of the graph for the modular graph We set and perform graph learning 15 times as in Section IV-A. The computation time for is summarized in Table II. It can be seen that the proposed method is 5.38 – 43.4 times faster than SGL. Although the computation time of the proposed method is higher due mainly to the eigenvalue decomposition of the larger sized matrix compared to CGL, we emphasize that the performance improvements are remarkable especially for the grid graph (See Fig. 4). The significant advantage in CPU time is due to the fact that the proposed method requires a few thousand iterations for approximate convergence, while SGL requires over iterations on average with per-iteration complexity of order .
Iv-B Experiments with real data
We test our method for the animal dataset [Kemp10687], in which each node represents each animal and the edges represent how much the animals are related to each other. The dataset contains binary values (i.e., it is a categorical dataset) which represent the answers to some questions such as “has lungs?” for instance. There are 102 such questions in total, answered for 33 different animals. The covariance matrix is created based on this data set, and the graph is learned in the same way as in the previous experiment. The results are shown in Fig. 7. It can be seen that the proposed method produces a sparser graph than CGL and SGL while preserving the dominant links.
We presented a graph learning method inserting the nonconvex MC penalty into the extension of the graphical lasso formulation to produce sparse and accurate graphs. With the linear-operator-based representation of CGL together with the reformulation through the Moreau decomposition, an efficient algorithm was derived with the primal-dual splitting method, for which an admissible choice of parameters was presented to ensure the provable convergence. Numerical examples showed that the proposed method significantly outperformed CGL for the high-sparsity grid graphs, while still achieving higher F-scores than CGL and SGL (the state-of-the-art method) for the low-sparsity modular graphs. In addition, the comparisons in CPU time showed that the proposed method was dramatically faster than SGL.
Appendix A Proof of Proposition 1
Let . By definition of adjoint operator, it holds that =, of which the left side can be expanded as . It can therefore be seen that and . In general, it can be verified that .
Appendix B Proof of Proposition 2
The assertion can be verified by combining the basic property [combettes_book, Proposition 24.8]
and the following fact [yukawa, Proposition 1]:
for . Indeed, since , we have . Applying (B.1) to and , we obtain
Appendix C Proof of Proposition 3
The proximity operator of
is given by [combettes_book, Example 24.66]
By (C) and the property [combettes_book, Proposition 24.8]:
we obtain the result.
Appendix D Proof of Proposition 4
It is clear from (11) that is convex when . The convergence condition of the primal dual splitting method is given as follows [primal-dual]:
where is the Lipschitz constant of and is the operator norm. We show below the Lipschitz constant of and the operator norm of . (Although it is shown in [palomar20] that , we show the proof for self-containedness.)
D-a Derivation of the Lipschitz constant of
The gradient of is given by
Hence, by the nonexpansiity of the proximity operator as well as the triangular inequality of norm, we obtain
from which is -Lipschitz contiuous.
D-B Derivation of
By definition of Laplacian, we have