 # Estimation of sparse Gaussian graphical models with hidden clustering structure

Estimation of Gaussian graphical models is important in natural science when modeling the statistical relationships between variables in the form of a graph. The sparsity and clustering structure of the concentration matrix is enforced to reduce model complexity and describe inherent regularities. We propose a model to estimate the sparse Gaussian graphical models with hidden clustering structure, which also allows additional linear constraints to be imposed on the concentration matrix. We design an efficient two-phase algorithm for solving the proposed model. We develop a symmetric Gauss-Seidel based alternating direction method of the multipliers (sGS-ADMM) to generate an initial point to warm-start the second phase algorithm, which is a proximal augmented Lagrangian method (pALM), to get a solution with high accuracy. Numerical experiments on both synthetic data and real data demonstrate the good performance of our model, as well as the efficiency and robustness of our proposed algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let

be a random vector following a multivariate Gaussian distribution

with an unknown nonsingular covariance matrix . Gaussian graphical models  estimate the concentration matrix from a sample covariance matrix of . It is known that if and only if and are conditionally independent, given all the other variables. The Gaussian graphical model can be represented by an undirected graph , where the vertices contain coordinates and the edges describe the conditional independence relationships among . There is no edge between and if and only if .

To detect nonzero elements in the concentration matrix , researchers have proposed sparse Gaussian graphical models [33, 1]. Given a sample covariance matrix , the sparse Gaussian graphical model attempts to estimate the concentration matrix by solving the following -regularized log-likelihood minimization problem:

 minX⪰0 {⟨C,X⟩−logdet(X)+ρ∑i

where is a given positive parameter, is the standard trace inner product between and , and means that is positive semidefinite. We adopt the convention that . The -norm penalty, which is motivated by the lasso idea , enforces element-wise sparsity on . There are many methods for solving the sparse Gaussian graphical model, such as the well-known GLasso algorithm , the Newton-CG primal proximal point algorithm , and QUIC .

The concentration matrix may have additional structures other than sparsity. For example, Honorio et al. in  enforce the local constancy to find connectivities between two close or distant clusters of variables; Højsgaard and Lauritzen in [11, 12] propose the restricted concentration models where parameters associated with edges or vertices of the same class are restricted to being identical; Duchi et al. in  penalize certain groups of edges together. In all these models, the clusters of the coordinates are assumed to be known. However, in many real applications like the gene expression in cancer data [15, 32], the group/cluster information may be unknown in advance. The authors in  propose a two stage method for learning sparse Gaussian graphical models with unknown block structure. They propose a variational Bayes algorithm to learn the block structure in the first stage, then estimate the concentration matrix by using the block method in the second stage.

Here we aim to estimate the sparse concentration matrix and uncover the hidden clustering structure of the coordinates simultaneously. Note that in the context of a linear regression model where the regression coefficients are expected to be clustered into groups, the clustered lasso regularizer

[3, 21, 25, 28] has been widely used. We borrow the idea of the regularization term to discover the sparsity and unknown clustering structure in the Gaussian graphical models. Thus we modify the sparse Gaussian graphical model (1) as follows:

 (2)

where

are given parameters. In the above model, the penalty on the pairwise differences is to force those entries of the concentration matrix associated with the same cluster of underlying random variables to be the same. In that way, the clustering structure of the random variables can then be discovered. In some more complicated cases, the conditional independence pattern may be partially known. To deal with those cases, one can impose additional constraints on

to get the following model:

 (3)

where is the set of pairs of nodes such that and are known to be conditionally independent.

Motivated by the above discussions, in this paper, we consider a more general problem which allows for general linear equality constraints to be imposed on , i.e.,

 minX∈Sn {⟨C,X⟩−μlogdet(X)+ρ∑i

where is a given linear map, is a given vector, are given parameters. Solving the problem (P) with a large is a challenging task due to the combination effects of the positive semidefinite variable and the complicated regularization term together with the linear constraints. At a first glance, it would appear to be extremely expensive to evaluate the second part of as it involves approximately terms, thus it becomes unthinkable to even solve (P) for the case when is large. Fortunately, as we shall see later, the symmetric nature of the summation allows us to carry out the evaluation of the regularization term in operations. This reduction in the computation cost makes it possible to solve the problem (P) for large .

Our contributions in this paper can be summarized in three parts. Firstly, we propose the model (P) to estimate the sparse Gaussian graphical model with hidden clustering structure, which also allows additional linear constraints to be imposed on the concentration matrix. As far as we are aware of, this is the first model that attempts to estimate the concentration matrix and uncover the hidden clustering structure in the variables simultaneously. Secondly, we design an efficient two-phase algorithm for solving the dual of (P). We develope a symmetric Gauss-Seidel based alternating direction method of the multipliers (sGS-ADMM) to generate an initial point to warm-start the second phase algorithm, which is a proximal augmented Lagrangian method (pALM), to get a solution with high accuracy. For solving the pALM subproblems, we use the semismooth Newton method where the sparsity and clustering structure is carefully analysed and exploited in the underlying generalized Jacobians to reduce the computational cost in each semismooth Newton iteration. Thirdly, we conduct comprehensive numerical experiments on both synthetic data and real data to demonstrate the performance of our model, as well as the efficiency and robustness of our proposed algorithm. The numerical results show that our model can rather successfully estimate the concentration matrix as well as uncovering its clustering structure.

The remaining parts of the paper are organized as follows. In Section 2, we state the problem setup and some related results in the literature. In Section 3, we describe the proposed two-phase algorithm for solving our model. In Section 4, we present the numerical results. Finally, in Section 5, we make some concluding remarks.

Throughout the paper, we use to denote a vector consisting of the diagonal entries of a matrix and to denote a diagonal matrix whose diagonal is given by a vector . For any matrix , denotes the Frobenius norm of .

## 2 Problem setup and preliminaries

In this section, we set up the problem and present some related properties of the regularization term and the function , respectively.

### 2.1 Duality and optimality conditions

The minimization form for the dual of (P) is given by

 miny∈Rm,Z∈Sn,S∈Sn −⟨b,y⟩−μlogdet(Z)+Q∗(−S)−nμ+nμlogμ (D) s.t. C−A∗y−Z−S=0,Z⪰0,

where is the adjoint map of , is the Fenchel conjugate function of that is defined by for any . The KKT system associated with (P) and (D) is given as

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩C−A∗y−Z−S=0,XZ=μIn, Z⪰0, X⪰0,0∈∂Q(X)+S,AX=b. (4)

Throughout this paper, we make the blanket assumption that is surjective and the solution set to the KKT system (4) is nonempty. Since the objective function of (P) is strictly convex with respect to , the optimal solution to (P) is unique, which we denote as .

### 2.2 The proximal mapping and Moreau envelope

For a given closed convex function , where is a finite dimensional real Euclidean space equipped with an inner product and its induced norm . The Moreau envelope of at is defined as

 Ef(x)=miny∈H {12∥y−x∥2+f(y)}.

The corresponding minimizer, which is called the proximal mapping of at , is denoted as . It is proved in [24, 26] that is globally Lipschitz continuous with modulus and is finite-valued, convex and continuously differentiable with

 ∇Ef(x)=x−Proxf(x).

The Moreau identity states that for any , it holds that

 Proxtf(x)+tProxf∗/t(x/t)=x.

### 2.3 Results related ro the regularization term Q(⋅)

Let be the linear map such that is the vector obtained from by concatenating the columns of the strictly upper triangular part of sequentially into a vector of dimension . The adjoint is such that is the operation of first putting the entries of the vector into the strictly upper triangular part of an matrix , and then symmetrizing it. Denote

 q(x)=ρ∥x∥1+λp(x),p(x)=∑1≤k

Then it is obvious that

 Q(X)=q(BX).

The function is the clustered lasso regularizer in the context of the linear regression models, which is studied in [3, 21, 25, 28]. The associated conjugate function, proximal mapping and the corresponding generalized Jacobian of the proximal mapping has been carefully studied in . By making use of , we have that for any ,

and

 ProxQ(Y) =argminX∈Sn {12∥diag(X)−diag(Y)∥2+∥BX−BY∥2+q(BX)} =Diag(diag(Y))+B∗Proxq(2BY).

Next we state the following proposition to compute for all .

###### Proposition 1.

For any , it holds that

where is the Clarke generalized Jacobian of at .

###### Proof.

The equality follows from [10, Example 2.5]. ∎

Here we present some results on the clustered lasso regularizer that are taken from . Denote , where . Let be the vector whose components are those of sorted in a non-increasing order, that is . Then we have the following proposition, which describes an efficient way for evaluating .

###### Proposition 2.

(a) For any , it can be proved that

 p(x)=⟨w,x↓⟩,

where is defined by , . Thus the computational cost of evaluating can be reduced from to .
(b) For any given , let be a permutation matrix such that is sorted in a non-increasing order. Then the proximal mapping of at can be computed as

 Proxλp(y)=PTyΠD(Pyy−λw),

where (the metric projection onto ) can be computed by the pool-adjacent-violators algorithm  in operations.
(c) For any , the proximal mapping of at can be computed as

 Proxq(y)=Proxρ∥⋅∥1(Proxλp(y))=sign(Proxλp(y))∘max(|Proxλp(y)|−ρ,0).

where the sign function, the absolute value and the maximum value are taken component-wise.

Next we consider the generalized Jacobian of , which denoted as . The detailed derivation of could be found in . Note that is strongly semismooth on with respect to . In the implementation of our proppsed algorithm, we need an explicitly computable element in for any given . As discussed in , we denote

 ID(y):={i∣BiΠD(y)=0, i=1,⋯,¯n−1},

where is the -th row of . Then we define two diagonal matrices with

 σi={1,if i∈ID(Pyy−λw),0,otherwise,for i=1,2,⋯,¯n−1,

and with

 θi={0,if |Proxλp(y)|i≤ρ,1,otherwise,for i=1,2,⋯,¯n.

Based on these notations, the following proposition provides a computable element in .

###### Proposition 3.

For any , we have that

 W=ΘPTy(I¯n−BT(ΣBBTΣ)†B)Py∈M(y),

where denotes the pseduoinverse. Further details on the computation of could be found in [21, Proposition 2.8].

### 2.4 Results related to the logdet(⋅) function

The following proposition states the computation of the proximal mapping of and the corresponding Jacobian, which is directly obtained from [30, Lemma 2.1]. For simplicity, we denote for any .

###### Proposition 4.

For any given

, with its eigenvalue decomposition

, where is the vector of eigenvalues and the columns of

are the corresponding orthonormal set of eigenvectors. We assume that

. Given and the scaler function for all , we define its matrix counterpart:

 ϕ+μ(X):=PDiag(ϕ+μ(d))PT,

where is such that its -th component is given by .
(a) The proximal mapping of can be computed as

 Proxμr(X)=ϕ+μ(X).

(b) is continuously differentiable and its Fréchet derivative at is given by

 (ϕ+μ)′(X)[H]=P(Ω∘(PTHP))PT∀H∈Sn,

where is defined by

 Ωij=ϕ+μ(di)+ϕ+μ(dj)√d2i+4μ+√d2j+4μ,i,j=1,⋯,n.

## 3 A two-phase algorithm

In this section, we propose a two-phase algorithm to solve the problem (P) based on the augmented Lagrangian function of (D). In Phase I, we design a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the problem to a moderate level of accuracy. In Phase II, we employ a proximal augmented Lagrangian method (pALM) with its subproblems solved by the semismooth Newton method (SSN) to get a solution with high accuracy. Note that the sGS-ADMM not only can be used to generate a good initial point to warm-start the pALM, it can also be used alone to solve the problem. But as a first-order method, the sGS-ADMM may not be efficient enough in some cases to solve a problem to high accuracy. Thus in the second phase, we switch to the superlinearly convergent pALM to compute an accurate solution.

A natural way to solve the problem (D) is the popular alternating direction method of the multipliers (ADMM), but as shown via a counterexample in , the directly extended sequential Gauss-Seidel-type multi-block ADMM may not be convergent even with a small step length. Thus, in this paper, we employ a more delicate symmetric Gauss-Seidel-type multi-block ADMM, i.e., the sGS-ADMM to solve (D). As is shown in , the sGS-ADMM is not only guaranteed to converge theoretically, in practice it also performs better than the possibly nonconvergent directly extended multi-block ADMM.

The Lagrangian function associated with (D) is given by

 l(y,Z,S;X) :=−⟨b,y⟩−μlogdet(Z)+Q∗(−S)+δSn+(Z)−nμ+nμlogμ −⟨C−A∗y−Z−S,X⟩,∀(y,Z,S,X)∈Rm×Sn×Sn×Sn. (5)

For , the associated augmented Lagrangian function is

 Lσ(y,Z,S;X):=l(y,Z,S;X)+σ2∥C−A∗y−Z−S∥2. (6)

Based on the augmented Lagrangian function (6), we design the sGS-ADMM for solving (D). To be specific, we update and alternatively as in the commonly used -block ADMM, but with the key difference of applying the sGS iteration technique  to the second block. The template for the algorithm is given as follows:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Zk+1=argminLσ(yk,Z,Sk;Xk),¯¯¯yk+1=argminLσ(y,Zk+1,Sk;Xk),Sk+1=argminLσ(¯yk+1,Zk+1,S;Xk),yk+1=argminLσ(y,Zk+1,Sk+1;Xk),Xk+1=Xk−τσ(C−A∗yk+1−Zk+1−Sk+1),

where is a given step length that is typically set to be . The implementation of updating each variable can be given as follows.

##### Updating of Z.

Given , can be obtained by

where and .

##### Updating of y.

Given , can be obtained by solving the linear system as

 ¯¯¯y =argminy∈Rn {−⟨b,y⟩+σ2∥C−A∗y−ˆZ−ˆS−1σˆX∥2}=(AA∗)−1(A(C−ˆS−ˆZ−1σˆX)+1σb).
##### Updating of S.

Given , the updating of could be given as

 ¯¯¯¯S =argminS∈Sn {Q∗(−S)+σ2∥S+ˆV∥2}=−ProxQ∗(ˆV)=−ˆV+ProxQ(ˆV),

where .

The whole sGS-ADMM for solving (D) can be summarized as below.

Input: , , , , , and .

1:  Compute
 Zk+1=(ϕ+γ(Mk)−Mk)/σ,Mk=Xk−σ(C−A∗yk−Sk).
2:  Compute
 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩¯¯¯yk+1=(AA∗)−1(A(C−Sk−Zk+1−Xk/σ)+b/σ),Sk+1=−Vk+ProxQ(Vk),Vk=−(C−A∗¯¯¯yk+1−Zk+1−Xk/σ),yk+1=(AA∗)−1(A(C−Sk+1−Zk+1−Xk/σ)+b/σ).
3:  Compute
 Xk+1=Xk−τσ(C−A∗yk+1−Sk+1−Zk+1).
4:  , go to Step 1.

The convergence result of the above algorithm can be obtained from [5, Theorem 5.1] without much difficulty.

###### Theorem 1.

Let be the sequence generated by the sGS-ADMM. Then the sequence converges to an optimal solution of (D) and converges to the optimal solution of (P).

### 3.2 Phase II: pALM

The augmented Lagrangian method (ALM) is a widely used method for solving the convex optimization problem in the literature. It has the important property of possessing superlinear convergence guarantee.

We write the dual problem (D) in the following unconstrained form

 −⟨b,y⟩−μlogdet(C−A∗y−S)+Q∗(−S) +δSn+(C−A∗y−S)−nμ+nμlogμ}. (D′)

Denote

 ˜f(y,S,Z,V) =−⟨b,y⟩−μlogdet(C−A∗y−S−Z) +Q∗(−S+V)+δSn+(C−A∗y−S−Z)−nμ+nμlogμ.

Then by [27, Example 11.46], the Lagrangian function associated with (3.2) is

 ˜l(y,S;X,U) =infZ∈Sn,V∈Sn{˜f(y,S,Z,V)−⟨Z,X⟩−⟨U,V⟩} =−⟨b,y⟩−⟨C−A∗y−S,X⟩+μlogdetX−δSn+(X)−⟨U,S⟩−Q(U).

By [27, Example 11.57], the corresponding augmented Lagrangian function is

 ˜Lσ(y,S;X,U)=sup˜X∈Sn,˜U∈Sn{˜l(y,S;˜X,˜U)−12σ∥X−˜X∥2−12σ∥U−˜U∥2} =−⟨b,y⟩−1σEμσr(M(y,S))+12σ∥M(y,S)∥2−12σ∥X∥2−1σEσQ(U−σS)+12σ∥U−σS∥2−12σ∥U∥2,

where .

Based on the above notations, we describe the proximal augmented Lagrangian method (pALM) for solving (3.2) as follows.

Algorithm 2 : pALM

Input: , , , , , .

1:  Compute
 (7)
2:  Compute
 Xk+1 =Proxμσkr(Xk−σ(C−A∗yk+1−Sk+1)), Uk+1 =ProxσkQ(Uk−σSk+1).
3:  Update , , go to Step 1.

#### 3.2.1 Convergence result of the pALM

The global convergence and global linear-rate convergence of the pALM can be obtained following the idea in . To establish the convergence result, we define the maximal monotone operator

 T˜l(y,S,X,U):={(y′,S′,X′,U′)∣(y′,S′,−X′,−U′)∈∂˜l(y,S;X,U)},

and its inverse operator

 T−1˜l(y′,S′,X′,U′):=argminy,SmaxX,U {˜l(y,S;X,U)−⟨y′,y⟩−⟨S′,S⟩+⟨X′,X⟩+⟨U′,U⟩}.

As we note in the pALM, we need to specify the stopping criterion of computing the approximate solution in (7). Denote the operator

 Λ=Diag(τIm,τIn,In,In),

where is the identity operator over . We use the following stopping criteria for solving (7):

 ∥∇Ψk(yk+1,Sk+1)∥ ≤min{√τ,1}σkεk, (A) ∥∇Ψk(yk+1,Sk+1)∥ ≤min{√τ,1}σkδk∥(yk+1,Sk+1,Xk+1,Uk+1)−(yk,Sk,Xk,Uk)∥Λ, (B)

where and are summable nonnegative sequences satisfying for all .

Based on the above preparation, we could present the convergence result of the pALM in the following theorem, which is a direct application of [20, Theorem 1 and Theorem 2]

###### Theorem 2.

(a) Let be the sequence generated by the pALM with the stopping criterion (A). Then is bounded, converges to an optimal solution of (3.2), and both and converge to the optimal solution of (P).
(b) Let be a positive number such that . Asuume that there exists such that

 distΛ((y,S,X,U),T−1ˆl(0))≤κdist(0,Tˆl(y,S,X,U)),

for all satisfying . Suppose that the initial point satisfies

 distΛ((y0,S0,X0,U0),T−1ˆl(0))≤ρ−∞∑k=0εk.

Let be the sequence generated by the pALM with the stopping criteria (A) and (B). Then for , it holds that

 distΛ((yk+1,Sk+1,Xk+1,Uk+1),T−1ˆl(0))≤μkdistΛ((yk,Sk,Xk,Uk),T−1ˆl(0)),

where

 μk=δk+(1+δk)κmax{τ,1}/√σ2k+κ2max{τ2,1}1−δk→μ∞:=κmax{τ,1}√σ2∞+κ2max{τ2,1}.

#### 3.2.2 A semismooth Newton method for solving the pALM subproblems

As one can see, the main task in the pALM is to solve the subproblem (7) in an efficient way. Note that given , the subproblem (7) has the form of

Since is a strongly convex function on , the above minimization problem has a unique optimal solution, denoted as , which can be computed by solving the nonsmooth optimality condition:

 ∇Ψ(y,S)=(−b+AProxμσr(˜M(y,S))+τσ(y−˜y)Proxμσr(˜M(y,S))−ProxσQ(˜U−σS)+τσ(S−˜S))=0, (8)

where .

Define the operator as

 ^∂2Ψ(y,S)(dydS)=σ(AIn)(ϕ+μσ(˜M(y,S)))′(A∗dy+dS)+σ(0∂ProxσQ(˜U−σS)[dS])+τσ(dydS),

for any , . We can treat as the generalized Jacobian of at . By the analysis of the regularization term and the function in Section 2, is strongly semismooth with respect to . Thus we could apply the semismooth Newton method (SSN) to solve (8), which has the following template.

Algorithm 3 : SSN

Input: , choose , , and set .

1:  Choose , use the conjugate gradient method (CG) to solve the linear system
 σ(AIn)(ϕ+γ(˜M(yj,Sj)))′(A∗djy+djS)+σ(0HjdjS)+τσ(djydjS)=−∇Ψ(yj,Sj)
to obtain and such that the residual is no larger than .
2:  Set , where is the first nonnegative integer for which
 Ψ(yj+δmdjy,Sj+δmdjS)≤Ψ(yj,Sj)+ζδm⟨∇Ψ(yj,Sj),(djydjS)⟩.
3:  Set , , , go to Step 1.

Since the operator is positive definite, we can easily obtain the following superlinear convergence result of the SSN method from .

###### Theorem 3.

Let be the sequence generated by the SSN method, then converges to and

 ∥(yj+1,Sj+1)−(¯¯¯y,¯¯¯¯S)∥=O(∥(y