# A Unified Framework for Structured Graph Learning via Spectral Constraints

Graph learning from data represents a canonical problem that has received substantial attention in the literature. However, insufficient work has been done in incorporating prior structural knowledge onto the learning of underlying graphical models from data. Learning a graph with a specific structure is essential for interpretability and identification of the relationships among data. Useful structured graphs include the multi-component graph, bipartite graph, connected graph, sparse graph, and regular graph. In general, structured graph learning is an NP-hard combinatorial problem, therefore, designing a general tractable optimization method is extremely challenging. In this paper, we introduce a unified graph learning framework lying at the integration of Gaussian graphical models and spectral graph theory. To impose a particular structure on a graph, we first show how to formulate the combinatorial constraints as an analytical property of the graph matrix. Then we develop an optimization framework that leverages graph learning with specific structures via spectral constraints on graph matrices. The proposed algorithms are provably convergent, computationally efficient, and practically amenable for numerous graph-based tasks. Extensive numerical experiments with both synthetic and real data sets illustrate the effectiveness of the proposed algorithms. The code for all the simulations is made available as an open source repository.

## Authors

• 32 publications
• 6 publications
• 3 publications
• 1 publication
09/24/2019

### Structured Graph Learning Via Laplacian Spectral Constraints

Learning a graph with a specific structure is essential for interpretabi...
06/04/2020

### Learning DAGs without imposing acyclicity

We explore if it is possible to learn a directed acyclic graph (DAG) fro...
08/10/2016

### Combinatorial Inference for Graphical Models

We propose a new family of combinatorial inference problems for graphica...
11/12/2015

### Learning Nonparametric Forest Graphical Models with Prior Information

We present a framework for incorporating prior information into nonparam...
05/11/2021

05/20/2016

### Learning to Discover Sparse Graphical Models

We consider structure discovery of undirected graphical models from obse...
07/13/2020

### Lossless Compression of Structured Convolutional Models via Lifting

Lifting is an efficient technique to scale up graphical models generaliz...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphs are fundamental mathematical structures consisting of sets of nodes and weighted edges among them. The weight associated with each edge represents the similarity between the two vertices it connects. Graphical models provide an effective abstraction for expressing dependence relationships among data variables available across numerous applications (see Barabási et al., 2016; Wang et al., 2018; Friedman et al., 2008; Guo et al., 2011; Segarra et al., 2017; Banerjee et al., 2008). The aim of any graphical model is to encode the dependencies among the data in the form of a graph matrix, where non-zero entries of the matrix imply the dependencies among any two variables. Gaussian graphical modeling (GGM) encodes the conditional dependence relationships among a set of variables (Dempster, 1972; Lauritzen, 1996)

. GGM is a tool of increasing importance in a number of fields, including finance, biology, statistical learning, and computer vision

(Friedman et al., 2008)

. In this framework, an undirected graph is matched to the variables, where each vertex corresponds to one variable, and an edge is present between two vertices if the corresponding random variables are conditionally dependent

(Lauritzen, 1996). Putting it more formally, consider an

dimensional vector

, the GGM method aims to learn a graph through the following optimization problem

 maximizeΘ∈Sp++ logdet(Θ)−tr(ΘS)−αh(Θ), (1)

where denotes the desired graph matrix with the number of nodes in the graph, denotes the set of positive definite matrices of size , is a similarity matrix, is the regularization term, and is the regularization parameter. When the observed data is distributed according to a zero-mean

variate Gaussian distribution and the similarity matrix is the sample covariance matrix (SCM), the optimization in (

1

) corresponds to the maximum likelihood estimation (MLE) of the inverse covariance (precision) matrix of the Gaussian random variable also known as Gaussian Markov Random Field (GMRF). With the graph inferred from

, the random vector follows the Markov property: implies and are conditionally dependent given the rest (see Lauritzen, 1996; Dempster, 1972).

In many real-world applications, prior knowledge about the underlying graph structure is usually available. For example, in gene network analysis, genes can be grouped into pathways, and connections within a pathway might be more likely than connections between pathways, forming a cluster (Marlin and Murphy, 2009). For better interpretability and precise identification of the structure in the data, it is desirable to enforce structures on the learned graph matrix . Furthermore, the structured graph also enables performing more sophisticated tasks such as prediction, community detection, clustering, and causal inference.

It is known that if the ultimate goal is structured graph learning, structure inference and graph weight estimation should be done in a single-step (Ambroise et al., 2009; Hao et al., 2018). Performing the structure inference (also known as model selection) prior to the weight estimation (also known as parameter estimation) in the selected model will, in fact, result in a non-robust procedure (Ambroise et al., 2009). Although GGM has been extended to incorporate structures on the learned graph, most of the existing methods perform graph structure learning and graph weight estimation separately. Essentially, the methods are either able to infer connectivity information (Ambroise et al., 2009) or with known connectivity information could perform the graph weights estimation (see Lee and Liu, 2015; Wang, 2015; Cai et al., 2016; Danaher et al., 2014; Pavez et al., 2018; Egilmez et al., 2017). Furthermore there are few recent works considering the two tasks jointly, but those methods are limited to some specific structures (e.g., multi-component in Hao et al., 2018) which cannot be extended to other graph structures. In addition, these methods involve computationally demanding multi-stage steps, which make it unsuitable for big data applications.

In general, structured graph learning is an NP-hard combinatorial problem (Anandkumar et al., 2012; Bogdanov et al., 2008)

which brings difficulty in designing a general tractable optimization method. In this paper, we propose to integrate spectral graph theory with GGM graph learning, and convert combinatorial constraints of graph structure into analytical constraints on graph matrix eigenvalues. Realizing the fact that combinatorial structures of a family of graphs (e.g., multi-component graph, bipartite graph, etc.) are encoded in the eigenvalue properties of their graph matrices, we devise a general framework of

Structured Graph (SG) learning by enforcing spectral constraints instead of combinatorial structure constraints directly. We develop computationally efficient and theoretically convergent algorithms that can learn graph structures and weights simultaneously.

### 1.1 Related work

The penalized likelihood approach with sparsity regularization has been widely studied in precision matrix estimation. An norm regularization () which promotes element-wise sparsity on the graph matrix is a common choice of regularization function to enforce a sparse structure (Yuan and Lin, 2007; Shojaie and Michailidis, 2010a, b; Ravikumar et al., 2010; Mazumder and Hastie, 2012; Fattahi and Sojoudi, 2019). Authors in Friedman et al. (2008) came up with an efficient computational method to solve (1) and proposed the well-known GLasso algorithm. In addition, non-convex penalties are proposed for sparse precision matrix estimation to reduce estimation bias (Shen et al., 2012; Lam and Fan, 2009). However, if a specific structure is required then simply a sparse graphical modeling is not sufficient, since it only enforces a uniform sparsity structure (Heinävaara et al., 2016; Tarzanagh and Michailidis, 2017). Towards this, the sparse GGM model should be extended to incorporate more specific structures.

In this direction, the work in Ambroise et al. (2009)

has considered the problem of graph connectivity inference for multi-component structure and developed a two-stage framework lying at the integration of expectation maximization (EM) and the graphical Lasso framework. The works in

Lee and Liu (2015); Wang (2015); Cai et al. (2016); Danaher et al. (2014); Guo et al. (2011); Sun et al. (2015); Tan et al. (2015) have considered the problem of edge-weight estimation with the known connectivity information. However, prior knowledge of connectivity information is not always available, in particular for the massive data with complex and unknown population structures (Hao et al., 2018; Jeziorski and Segal, 2015). Furthermore, considering simultaneous connectivity inference and graph weight estimation, two-stage methods based on Bayesian model (Marlin and Murphy, 2009) and expectation maximization (Hao et al., 2018) were proposed, but these methods are computationally prohibitive and limited to only multi-componet graph structures.

Other important graph structures have also been considered for example: factor models in Meng et al. (2014), scale free in (Liu and Ihler, 2011)

, eigenvector centrality prior in

Fiori et al. (2012), degree-distribution in Huang and Jebara (2008), and overlapping structure with multiple graphical models in Tarzanagh and Michailidis (2017); Mohan et al. (2014), tree structure in Chow and Liu (1968); Anandkumar et al. (2012). Recently, there has been a considerable interest in enforcing the Laplacian structure (see Lake and Tenenbaum, 2010; Slawski and Hein, 2015; Pavez and Ortega, 2016; Kalofolias, 2016; Egilmez et al., 2017; Pavez et al., 2018) but all these methods are limited to learning a graph without specific structural constraints, or just learn Laplacian weights for a graph with the connectivity information.

Due to the complexity posed by the graph learning problem, owing to its combinatorial nature, existing methods are tailored to specific structures which cannot be generalized to other graph structures; require connectivity information for graph weight estimation; often involve multi-stage framework and become computationally prohibitive. Furthermore, there does not exist any GGM framework to learn a graph with useful structures such as bipartite structure, regular structure and multi-component bipartite structure.

### 1.2 Summary of contributions

Enforcing a structure onto a graph is generally an NP-hard combinatorial problem, which is difficult to solve via existing methods. In this paper, we propose a unified framework of structured graph learning. Our contributions are threefold:

First, we introduce new problem formulations that convert the combinatorial constraints into analytical spectral constraints on Laplacian and adjancency matrices, resulting in three main formulations:

• Structured graph learning via Laplacian spectral constraints:
This formulation utilizes the Laplacian matrix spectral properties to learn multi-component graph, regular graph, multi-component regular graph, sparse connected graph, modular graph, grid graph and other specific structured graphs.

• Structured graph learning via adjacency spectral constraints
This formulation utilizes spectral properties of the adjacency matrix for bipartite graph learning.

• Structured graph learning via Laplacian and adjacency spectral constraints
Under this formulation we simultaneously utilize spectral properties of Laplacian and adjacency matrices to enforce non-trivial structures including bipartite-regular graph, multi-component bipartite graph, and multi-component bipartite-regular graph structures.

Second, we develop algorithms based on the block majorization-minimization (MM) framework also known as block successive upper-bound minimization (BSUM) to solve the proposed formulations. The algorithms are theoretically convergent and computationally efficient with worst case complexity , which is same as that of GLasso.

Third, we verify the effectiveness of the proposed algorithms via extensive synthetic and real data sets experiments. We believe that the work carried out in this paper will provide a starting point for structured graph learning based on Gaussian Markov random fields and spectral graph theory, which in turn may have a significant and long-standing impact. The code for all the simulations is made available as open source repository on author’s website.

### 1.3 Outline and Notation

This paper is organized as follows. The generalized problem formulation and related background are provided in Section 2. The detailed algorithm derivations and the associated convergence results are presented in Sections 3, 4, and 5. Then the simulation results with both real and synthetic data sets for the proposed algorithms are provided in Section 6. Finally, Section 7 concludes the paper with a list of plausible extensions.

In terms of notation, lower case (bold) letters denote scalars (vectors) and upper case letters denote matrices, whose sizes are not stated if they are clear from the context. The -th entry of a matrix is denoted by . and denote the pseudo inverse and transpose of matrix , respectively. The all-zero and all-one vectors or matrices of all sizes are denoted by and , respectively. , denote -norm and Frobenius norm of , respectively. is defined as the generalized determinant of a positive semidefinite matrix , i.e., the product of its non-zero eigenvalues. The inner product of two matrices is defined as . is a diagonal matrix with diagonal elements of filling its principal diagonal and diag is a vector with diagonal elements of as the vector elements. Operators are defined using calligraphic letters.

## 2 Problem Formulation

A graph is denoted by , where is the vertex set, and is the edge set. If there is an edge between vertices and we denote it by . We consider a simple undirected graph with positive weights , having no self-loops or multiple edges and therefore its edge set consists of distinct pairs. Graphs are conveniently represented by some matrix (such as Laplacian and adjacency graph matrices), whose nonzero entries correspond to edges in the graph. The choice of a matrix usually depends on modeling assumptions, properties of the desired graph, applications, and theoretical requirements.

A matrix is called as a graph Laplacian matrix if its elements satisfy

 SΘ={Θ|Θij=Θji≤0 for i≠j;Θii=−∑j≠iΘij}. (2)

The properties of the elements of in (2) imply that the Laplacian matrix is: i) diagonally dominant (i.e., ); ii) positive semidefinite, implied from the diagonally dominant property (see den Hertog et al., 1993, Proposition 2.2.20.); iii) an -matrix, i.e., a positive semidefinite matrix with non-positive off-diagonal elements (Slawski and Hein, 2015); iv) zero row sum and column sum of (i.e., ), which means that the vector satisfies (Chung, 1997).

We introduce the adjacency matrix as

 ΘA={−Θij,ifi≠j0,ifi=j. (3)

The non-zero entries of the matrix encode edge weights as and implies no connectivity between vertices and .

###### Definition 1.

Let be an symmetric positive semidefinite matrix with rank . Then is an improper GMRF (IGMRF) of rank with parameters (assuming without loss of generality), if its density is

 p(x)=(2π)−(p−k)2(gdet(Θ))12exp(−12(x⊤Θx)) (4)

where denotes the generalized determinant (Rue and Held, 2005) defined as the product of non-zero eigenvalues of . Furthermore, is called IGMRF w.r.t to a graph , where

 Θij≠0⟺{i,j}∈E∀i≠j (5) Θij=0⟺xi⊥xj|x/(xi,xj). (6)

It simply states that the nonzero pattern of determines , so we can read off from whether and are conditionally independent. If the rank of is exactly then is called GMRF and parameters () represent the mean and precision matrix corresponding a -variate Gaussian distribution (Rue and Held, 2005). In addition, if precision has non-positive off-diagonal entries (Slawski and Hein, 2015) then random vector is called an attractive improper GMRF.

### 2.1 A General Framework for Graph Learning under Spectral Constraints

A general scheme is to learn the matrix as a Laplacian matrix under some eigenvalue constraints, which are motivated from the a priori information for enforcing structure on the learned graph. Now we introduce a general optimization framework for structured graph learning via spectral constraints on the graph matrices,

 (7)

where denotes the observed data statistics (e.g., the sample covariance matrix), is the sought graph matrix to be optimized, is the Laplacian matrix structural constraint set (2), is a regularization term (e.g., sparsity), denotes the eigenvalues of , which is the transformation of matrix . More specifically, if is identity, then , implying we impose constraints on the eigenvalues of the Laplacian matrix ; if defined in (3), then we enforce constraints on the eigenvalues of the adjacency matrix , and is the set containing spectral constraints on the eigenvalues.

Fundamentally, the formulation in aims to learn a structured graph Laplacian matrix given data statistics , where enforces Laplacian matrix structure and allows to include structural constraints of desired graph structure via spectral constraints on the eigenvalues. Observe that the formulation (7) has converted the complicated combinatorial structural constraints into the simple analytical spectral constraints, due to which, now the structured graph learning becomes a matrix optimization problem under the proper choice of spectral constraints.

###### Remark 1.

Apart from motivation of enforcing structure onto a graph, the Laplacian matrix is also desirable from numerous practical and theoretical considerations: i) Laplacian matrix is widely used in spectral graph theory, machine learning, graph regularization, graph signal processing, and graph convolution networks

(Smola and Kondor, 2003; Defferrard et al., 2016; Egilmez et al., 2017; Chung, 1997); ii) in the high-dimensional setting where the number of the data samples is less than the dimension of the data, learning as an matrix greatly simplifies the optimization problem by avoiding the need for the explicit regularization term (Slawski and Hein, 2015); iii) the graph Laplacian is crucial for utilizing the GMRF framework, which requires the matrix to have the positive semi-definite property (Rue and Held, 2005); iv) the graph Laplacian allows flexibility in incorporating useful spectral properties of graph matrices(Chung, 1997; Spielman and Teng, 2011).

###### Remark 2.

From the probabilistic perspective, when the similarity matrix is the sample covariance matrix of Gaussian data, (7) can be viewed as penalized maximum likelihood estimation problem of structured precision matrix of an improper attractive GMRF model, see Definition 1. In a more general setting with arbitrarily distributed data, when the similarity matrix is positive definite matrix, then formulation (7) can be related to the log-determinant Bregman divergence regularized optimization problem (see Dhillon and Tropp, 2007; Duchi et al., 2012; Slawski and Hein, 2015), where the goal is to find the parameters of multivariate Gaussian model that best approximates the data.

In the coming subsections, we will specialize the optimization framework in (8) under Laplacian eigenvalue constraints, adjacency eigenvalue constraints, and joint Laplacian and adjacency eigenvalue constraints.

### 2.2 Structured Graph Learning Via Laplacian Spectral Constraints

To enforce spectral constraints on the Laplacian matrix (i.e., in (7)), we consider the following optimization problem:

 % maximizeΘ,λ,Uloggdet(Θ)−%tr(ΘS)−αh(Θ),subject toΘ∈SΘ, Θ=UDiag(λ)UT, λ∈Sλ, UTU=I, (8)

where is the desired Laplacian matrix and admits the decomposition , is a diagonal matrix containing on its diagonal with , and is a matrix satisfying . We enforce to be a Laplacian matrix by the constraint , while we incorporate some specific spectral constraints on by forcing , with containing priori spectral information on the desired graph structure.

Next, we will introduce various choices of that will enable (8) to learn numerous popular graph structures.

#### 2.2.1 k-component graph

A graph is said to be component connected if its vertex set can be partitioned into disjoint subsets such that any two nodes belonging to different subsets are not connected by an edge. Any edge in edge set have end points in , and no edge connect two different components. The component structural property of a graph is directly encoded in the eigenvalues of its Laplacian matrix. The multiplicity of zero eigenvalue of a Laplacian matrix gives the number of connected components of a graph .

###### Theorem 1.

(Chung, 1997) The eigenvalues of any Laplacian matrix can be expressed as:

 Sλ={{λj=0}kj=1, c1≤λk+1≤…≤λp≤c2} (9)

where denotes the number of connected components in the graph, and are some constants that depend on the number of edges and their weights (see Spielman and Teng, 2011).

Figure 1 depicts a component graph and its Laplacian eigenvalues with =3 connected components and zero eigenvalues.

#### 2.2.2 Connected sparse graph

A sparse graph is simply a graph with not many connections among the nodes. Often, making a graph highly sparse can split the graph into several disconnected components, which many times is undesirable (Sundin et al., 2017; Hassan-Moghaddam et al., 2016). The existing formulation cannot ensure both sparsity and connectedness, and there always exists a trade-off between the two properties. Within the formulation (8) we can achieve sparsity and connectedness by using the following spectral constraint:

 Sλ={λ1=0,c1≤λ2≤…≤λp≤c2} (10)

with a proper choice of .

#### 2.2.3 d−regular graph

All the nodes of a -regular graph have the same weighted degree (), where weighted degree is defined as , which implies:

 Θ=dI−ΘA,diag(Θ)=d1,ΘA1=d1.

Within the above formulation (17) a regular structure on the matrix can be enforced by including the following constraints

 Sλ={λ1=0,c1≤λ2≤⋯≤λp≤c2},diag(Θ)=d1. (11)

#### 2.2.4 k−component d−regular graph

A component regular graph, also known as clustered regular graph is useful in providing improved perceptual grouping (Kim and Choi, 2009) for clustering applications. Within the above formulation (17) we can enforce this structure by including the following constraints

 Sλ={{λj=0}kj=1, c1≤λk+1≤⋯≤λp≤c2},diag(Θ)=d1. (12)

#### 2.2.5 Cospectral graphs

In many applications, it is motivated to learn with specific eigenvalues which is also known as cospectral graph learning (Godsil and McKay, 1982). One example is spectral sparsification of graphs (see Spielman and Teng, 2011; Loukas and Vandergheynst, 2018) which aims to learn a graph to approximate a given graph , while is sparse and its eigenvalues satisfy , where are the eigenvalues of the given graph and is some specific function. Therefore, for cospectral graph learning, we introduce the following constraint

 Sλ={λi=f(¯λi),∀i∈[1,p]}. (13)

### 2.3 Structured Graph Learning Via Adjacency Spectral Constraints

To enforce spectral constraints on adjacency matrix (i.e., in (7)), we introduce the following optimization problem:

 maximizeΘ,ψ,Vloggdet(Θ)−tr(ΘS)−αh(Θ),subject toΘ∈SΘ, ΘA=VDiag(ψ)VT, ψ∈Sψ, VTV=I, (14)

where is the desired Laplacian matrix, is the corresponding adjacency matrix which admits the decomposition with and . We enforce to be a Laplacian matrix by the constraint , while we incorporate some specific spectral constraints on its adjacency matrix by forcing , with containing priori spectral information of the desired graph structure.

Next, we will introduce various choices of that will enable (14) to learn bipartite graph structures.

#### 2.3.1 General bipartite graph

A graph is said to be bipartite if its vertex set can be partitioned into two disjoint subsets such that no two points belonging to the same subset are connected by an edge (Zha et al., 2001), i.e. for each then . Spectral graph theory states that a graph is bipartite if and only if the spectrum of the associated adjacency matrix is symmetric about the origin (Van Mieghem, 2010, Ch.5) (Mohar, 1997).

###### Theorem 2.

(see Mohar, 1997) A graph is bipartite if and only if the spectrum of the associated adjacency matrix is symmetric about the origin

 Sψ={ψi=−ψp−i+1, ∀i=1,…,p (15) ψ1≥ψ2≥⋯≥ψp}.

#### 2.3.2 Connected bipartite graph

The Perron-Frobenius theorem states that if a graph is connected, then the largest eigenvalue of its adjacency matrix has multiplicity 1 (Mohar, 1997). Thus, a connected bipartite graph can be learned by including additional constraint on the multiplicity to be one on the largest and smallest eigenvalues, i.e. are not repeated. Figure 2 shows a connected bipartite graph and its adjacency symmetric eigenvalues.

###### Theorem 3.

(see Mohar, 1997) A graph is connected bipartite graph if and only if the spectrum of the associated adjacency matrix is symmetric about the origin with non-repeated extreme eigenvalues

 Sψ={ψi=−ψp−i+1, ∀i=1,⋯,p (16) ψ1>ψ2≥⋯≥ψp−1>ψp}.

### 2.4 Structured Graph Learning Via Joint Laplacian and Adjacency Spectral Constraints

To enforce spectral constraints on Laplacian matrix and adjacency matrix , we introduce the following optimization problem:

 (17)

where is the desired Laplacian matrix which admits the decomposition with , , and is the corresponding adjacency matrix which admits the decomposition with and . Observe that the above formulation learns a graph Laplacian matrix with a specific structure by enforcing the spectral constraints on the adjacency and Laplacian matrices simultaneously. Next, we will introduce various choices of and that will enable (17) to learn non-trivial complex graph popular graph structures.

#### 2.4.1 k−component bipartite graph

A component bipartite graph, also known as bipartite graph clustering, has a significant relevance in many machine learning and financial applications (Zha et al., 2001). Recall that the bipartite structure can be enforced by utilizing the adjacency eigenvalues property (i.e., the constraints in (15)) and component structure can be enforced by the Laplacian eigenvalues (i.e., the zero eigenvalues with multiplicity ). These two disparate requirements can be simultaneously imposed in the current formulation (17), by choosing:

 Sλ={{λj=0}kj=1, c1≤λk+1≤…≤λp≤c2} (18) Sψ={ψi:ψi=−ψp−i+1,∀i=1,⋯,p}.

#### 2.4.2 k−component regular bipartite graph

The eigenvalue property of regular graph relates the eigenvalues of its adjacency matrix and Laplacian matrix, which is summarized in the following lemma.

###### Theorem 4.

(Mohar, 1997) Collecting the Laplacian eigenvalues in increasing order () and the adjacency eigenvalues in decreasing order (), then the eigenvalue pairs for a -regular graph are related as follows:

 λi=d−ψi,∀i=1,⋯,p. (19)

A component regular bipartite structure can be enforced by utilizing the adjacency eigenvalues property (for bipartite structure), Laplacian eigenvalues (for component structure) along with the joint spectral properties for the regular graph structure:

 Sλ={{λj=0}kj=1, c1≤λk+1≤…≤λp≤c2} (20) Sψ={ψi:ψi=d−λi;ψi=−ψp−i+1,∀i=1,⋯,p},

### 2.5 Block Successive Upper-bound Minimization algorithm

The resulting optimization formulations presented in (8), (14), and (17) are still complicated. The aim here is to develop efficient optimization methods with low computational complexity based on the BSUM and majorization-minimization framework (Razaviyayn et al., 2013; Sun et al., 2016). To begin with, we present a general schematic of the BSUM optimization framework

 minimizexf(x)subject tox∈X, (21)

where the optimization variable is partitioned into blocks as , with , is a closed convex set, and is a continuous function. At the th iteration, each block is updated in a cyclic order by solving the following:

 (22)

where with is a majorization function of at satisfying

 gi(xi|yti)is % continuous in (xi,yti),∀i, (23a) gi(xti|yti)=f(xt1,⋯,xti−1,xti,xt−1i+1,⋯,xt−1m), (23b) gi(xi|yti)≥f(xt1,⋯,xti−1,xi,xt−1i+1,⋯,xt−1m),∀xi∈Xi,∀yi∈X,∀i, (23c) g′i(xi;di|yti)|xi=xti=f′(xt1,⋯,xti−1,xi,xt−1i+1,⋯,xt−1m;d), ∀d=(0,⋯,di,⋯,0)such thatxti+di∈Xi,∀i, (23d)

where stands for the directional derivative at along (Razaviyayn et al., 2013). In summary, the framework is based on a sequential inexact block coordinate approach, which updates the variable in one block keeping the other blocks fixed. If the surrogate functions is properly chosen, then the solution to (22) could be easier to obtain than solving (21) directly.

## 3 Structured Graph Learning Via Laplacian Spectral Constraints (Sgl)

In this section, we develop a BSUM-based algorithm for Structured Graph learning via Laplacian spectral constraints (SGL). In particular, we consider solving (8) under component Laplacian spectral constraints (9). To enforce sparsity we use the regularization function (i.e., ). Next observing that the sign of is fixed by the constraints and , the regularization term can be written by , where , problem (8) becomes

 Θ,λ,Uminimize−loggdet(Θ)+tr(ΘK),subject toΘ∈SΘ, Θ=UDiag(λ)UT, λ∈Sλ, UTU=I, (24)

where . The resulting problem is complicated and intractable in the current form due to i) Laplacian structural constraints , ii) coupling variables , and iii) generalized determinant on . In order to derive a more feasible formulation, we first introduce a linear operator which transforms the Laplacian structural constraints to simple algebraic constraints and then relax the eigen-decomposition expression into the objective function.

### 3.1 Graph Laplacian operator L

The Laplacian matrix belonging to satisfies i) , ii)

, implying the target matrix is symmetric with degrees of freedom of

equal to . Therefore, we introduce a linear operator that transforms a non-negative vector into the matrix that satisfies the Laplacian constraints ( and ).

###### Definition 2.

The linear operator is defined as

 [Lw]ij=⎧⎪⎨⎪⎩−wi+dji>j,jii

where

We derive the adjoint operator of by making satisfy .

###### Lemma 1.

The adjoint operator is defined by

 [L∗Y]k=yi,i−yi,j−yj,i+yj,j,k=i−j+j−12(2p−j),

where satisfy and .

A toy example is given to illustrate the operators and more clearly. Consider a weight vector . The Laplacian operator on gives

 Lw=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∑i=1,2,3wi−w1−w2−w3−w1∑i=1,4,5wi−w4−w5−w2−w4∑i=2,4,6wi−w6−w3−w5−w6∑i=3,5,6wi\par⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (25)

The operation of on a symmetric matrix returns a vector

 L∗Y=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣y11−y21−y12+y22y11−y31−y13+y33y11−y41−y14+y44y22−y32−y23+y33y22−y42−y24+y44y33−y43−y34+y44⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (26)

By the definition of , we have Lemma 2.

###### Lemma 2.

The operator norm is , where with .

###### Proof.

Follows from the definitions of and : see Appendix 8.1 for detailed proof. ∎

We have introduced the operator that helps to transform the complicated structural matrix variable to a simple vector variable . The linear operator is an important component of the SGL framework.

### 3.2 Sgl algorithm

To solve (24), we represent the Laplacian matrix as and then develop an algorithm based on quadratic methods (Nikolova and Ng, 2005; Ying et al., 2018). We introduce the term to keep close to instead of exactly solving the constraint , where . Note that this relaxation can be made tight by choosing sufficient large or iteratively increasing . Now, the original problem can be formulated as

 minimizew,λ,U−loggdet(Diag(λ))+tr(KLw)+β2∥Lw−UDiag(λ)UT∥2F,subject tow≥0, λ∈Sλ, UTU=I, (27)

where means each entry of is non-negative. When solving (27) to learn the component graph structure with the constraints in (9), the first zero eigenvalues as well as the corresponding eigenvectors can be dropped from the optimization formulation. Now the only contains non-zero eigenvalues in increasing order , then we can replace generalized determinant with determinant on in (27). contains the eigenvectors corresponding to the non-zero eigenvalues in the same order, and the orthogonality constraints on becomes . The non-zero eigenvalues are ordered and lie in the given set,

 Sλ={ c1≤λk+1≤…≤λp≤c2}. (28)

Collecting the variables in three block as , we develop a BSUM-based algorithm which updates only one variable each time with the other variables fixed.

#### 3.2.1 Update of w

Treating as a variable with and fixed, and ignoring the terms independent of , we have the following sub-problem:

 minimizew≥0tr(KLw)+β2∥Lw−UDiag(λ)UT∥2F. (29)

The problem (29) can be written as a non-negative quadratic problem,

 minimizew≥0f(w)=12∥Lw∥2F−cTw, (30)

where .

###### Lemma 3.

The sub-problem (30) is a strictly convex optimization problem.

###### Proof.

From the definition of operator and the property of its adjoint , we have

 (31)

The above result implies that is a strictly convex function. Together with the fact that the non-negativity set is convex, we conclude the sub-problem (30) is strictly convex. But, it is not possible here to derive a closed-form solution due to the non-negativity constraint (), and thus we derive a majorijation function. ∎

###### Lemma 4.

The function in (30) is majorized at by the function

 g(w|wt) =f(wt)+(w−wt)T∇f(wt)+L12∥∥w−wt∥∥2, (32)

where is the update from previous iteration and (see Lemma 2).

It is easy to check the conditions (23) for the majorization function (See more details in Sun et al., 2016; Song et al., 2015) and we ignore the proof here. Note that the majorization function as in (32) is in accordance with the requirement of the majorization as in (23b), because in the problem (30), and the other coordinates () are fixed. For notational brevity, we present the majorization function as instead of .

After ignoring the constant terms in (32), the majorized problem of (30) at is given by

 minimizew≥0g(w|wt)=12wTw−aTw, (33)

where and .

###### Lemma 5.

From the KKT optimality conditions we can easily obtain the optimal solution to (33) as

 wt+1=(wt−1L1∇f(wt))+, (34)

where .

#### 3.2.2 Update of U

Treating as a variable block, and fixing for and , we obtain the following sub-problem:

 (35)

The equivalent problem is reformulated as follows

 maximizeUtr(UTLwUDiag(λ)),% subject to UTU=Iq. (36)

The problem (36) is an optimization on the orthogonal Stiefel manifold . From (Absil et al., 2009; Benidis et al., 2016) the maximizer of (36) is the eigenvectors of (suitably ordered).

###### Lemma 6.

From the KKT optimality conditions the solution to (36) is given by

 U=eigenvectors(Lw)[k+1:p] (37)

that is, the principal eigenvectors of the matrix in the increasing order of the eigenvalue magnitude (Absil et al., 2009; Benidis et al., 2016).

#### 3.2.3 Update for λ

We obtain the following sub-problem for the update

 % minimizeλ∈Sλ −logdetDiag(λ)+β2∥Lw−UDiag(λ)UT∥2F. (38)

The optimization (38) can be rewritten as

 % minimizeλ∈Sλ −logdetDiag(λ)+β2∥UT(Lw)U−Diag(λ)∥2F. (39)

With slight abuse of notation and for ease of exposition, we denote the indices for the non-zero eigenvalues in (28) from to instead of to . The problem (39) can be further written as

 minimizec1≤λ1≤⋯≤λq≤c2 −q∑i=1logλi+β2∥λ−d∥22, (40)

where and with the -th diagonal element of . We derive a computationally efficient method to solve (40) from KKT optimality conditions. The update rule for follows an iterative procedure summarized in Algorithm 1. The sub-problem (40) is a convex optimization problem. One can solve the convex problem (40) with a solver (e.g., CVX) but we can do it more efficiently with our algorithm for large scale problems.

###### Lemma 7.

The iterative-update procedure summarized in Algorithm 1 converges to the KKT point of Problem (40).

###### Proof.

Please refer to the Appendix 8.2 for the detailed proof. ∎

To update ’s, Algorithm 1 iteratively check situations [cf. steps 6, 10 and 14] and updates the ’s accordingly until is satisfied. If some situation happens, then the corresponding ’s need to be updated accordingly. Note that the situations are independent from each other, i.e., each will not involve two situations simultaneously. Furthermore, ’s are updated iteratively according to the above situations until all of them satisfy the KKT conditions, the maximum number of iterations is .

###### Remark 3.

The problem of the form (40) is popularly known as a regularized isotonic regression problem. The isotonic regression is a well-researched problem that has found applications in numerous domains see (see Best and Chakravarti, 1990; Lee et al., 1981; Barlow and Brunk, 1972; Luss and Rosset, 2014; Bartholomew, 2004). To the best of our knowledge, however, there does not exist any computationally efficient method comparable to the Algorithm 1. The proposed algorithm can obtain a globally optimal solution within a maximum of iterations for the -dimensional regularized isotonic regression problem, and can be potentially adapted to solve other isotonic regression problems. The computationally efficient Algorithm 1 also holds an important contribution for the isotonic regression literature.

#### 3.2.4 Sgl algorithm summary

SGL in Algorithm 2 summarizes the implementation of the structured graph learning via Laplacian spectral constraints.

In Algorithm 2, the computationally most demanding step is the eigen-decomposition step required for the update of . Implying as the worst-case computational complexity of the algorithm. This can further be improved by utilizing the sparse structure and the properties of the symmetric Laplacian matrix for eigen-decomposition. The most widely used GLasso method (Friedman et al., 2008) has similar worst-case complexity, although the GLasso learns a graph without structural constraints. While considering specific structural requirements, the SGL algorithm has a considerable advantage over other competing structured graph learning algorithms in Marlin and Murphy (2009); Hao et al. (2018); Ambroise et al. (2009).

###### Theorem 5.

The sequence generated by Algorithm 2 converges to the set of KKT points of (27).

###### Proof.

The detailed proof is deferred to the Appendix 8.3. ∎

###### Remark 4.

Note that the SGL is not only limited to component graph learning, but can be easily adapted to learn other graph structures under aforementioned spectral constraints in (10), (11), (12), and (13). Furthermore, the SGL can also be utilized to learn popular connected graph structures (e.g., Erdos-Renyi graph, modular graph, grid graph, etc.) even without specific spectral constraints just by choosing the eigenvalue constraints corresponding to one component graph (i.e., ) and setting to very small and large values respectively. Detailed experiments with important graph structures are carried out in the simulation section.

## 4 Structured Graph Learning Via Adjacency Spectral Constraints (Sga)

In this section, we develop a BSUM-based algorithm for Structured Graph learning via Adjacencny spectral constraints (SGA). In particular, we consider to solve (14) for connected bipartite graph structure by introducing the spectral constraints on the adjacency eigenvalues (15). Since is a connected graph, the term can be simplified according to the following lemma.

###### Lemma 8.

If is a Laplacian matrix for a connected graph, then

 gdet(Θ)=det(Θ+J), (41)

where .

###### Proof.

It is easy to establish (41) by the fact that . ∎

### 4.1 Graph adjacency operator A

To guarantee the structure of adjacency matrix, we introduce a linear operator .

###### Definition 3.

We define a linear operator