## 1 Introduction

Let

be a random vector following a multivariate Gaussian distribution

with an unknown nonsingular covariance matrix . Gaussian graphical models [18] estimate the concentration matrix from a sample covariance matrix of . It is known that if and only if and are conditionally independent, given all the other variables. The Gaussian graphical model can be represented by an undirected graph , where the vertices contain coordinates and the edges describe the conditional independence relationships among . There is no edge between and if and only if .To detect nonzero elements in the concentration matrix , researchers have proposed sparse Gaussian graphical models [33, 1]. Given a sample covariance matrix , the sparse Gaussian graphical model attempts to estimate the concentration matrix by solving the following -regularized log-likelihood minimization problem:

(1) |

where is a given positive parameter, is the standard trace inner product between and , and means that is positive semidefinite. We adopt the convention that . The -norm penalty, which is motivated by the lasso idea [29], enforces element-wise sparsity on . There are many methods for solving the sparse Gaussian graphical model, such as the well-known GLasso algorithm [9], the Newton-CG primal proximal point algorithm [30], and QUIC [14].

The concentration matrix may have additional structures other than sparsity. For example, Honorio et al. in [13] enforce the local constancy to find connectivities between two close or distant clusters of variables; Højsgaard and Lauritzen in [11, 12] propose the restricted concentration models where parameters associated with edges or vertices of the same class are restricted to being identical; Duchi et al. in [7] penalize certain groups of edges together. In all these models, the clusters of the coordinates are assumed to be known. However, in many real applications like the gene expression in cancer data [15, 32], the group/cluster information may be unknown in advance. The authors in [23] propose a two stage method for learning sparse Gaussian graphical models with unknown block structure. They propose a variational Bayes algorithm to learn the block structure in the first stage, then estimate the concentration matrix by using the block method in the second stage.

Here we aim to estimate the sparse concentration matrix and uncover the hidden clustering structure of the coordinates simultaneously. Note that in the context of a linear regression model where the regression coefficients are expected to be clustered into groups, the clustered lasso regularizer

[3, 21, 25, 28] has been widely used. We borrow the idea of the regularization term to discover the sparsity and unknown clustering structure in the Gaussian graphical models. Thus we modify the sparse Gaussian graphical model (1) as follows:(2) |

where

are given parameters. In the above model, the penalty on the pairwise differences is to force those entries of the concentration matrix associated with the same cluster of underlying random variables to be the same. In that way, the clustering structure of the random variables can then be discovered. In some more complicated cases, the conditional independence pattern may be partially known. To deal with those cases, one can impose additional constraints on

to get the following model:(3) |

where is the set of pairs of nodes such that and are known to be conditionally independent.

Motivated by the above discussions, in this paper, we consider a more general problem which allows for general linear equality constraints to be imposed on , i.e.,

(P) |

where is a given linear map, is a given vector, are given parameters. Solving the problem (P) with a large is a challenging task due to the combination effects of the positive semidefinite variable and the complicated regularization term together with the linear constraints. At a first glance, it would appear to be extremely expensive to evaluate the second part of as it involves approximately terms, thus it becomes unthinkable to even solve (P) for the case when is large. Fortunately, as we shall see later, the symmetric nature of the summation allows us to carry out the evaluation of the regularization term in operations. This reduction in the computation cost makes it possible to solve the problem (P) for large .

Our contributions in this paper can be summarized in three parts. Firstly, we propose the model (P) to estimate the sparse Gaussian graphical model with hidden clustering structure, which also allows additional linear constraints to be imposed on the concentration matrix. As far as we are aware of, this is the first model that attempts to estimate the concentration matrix and uncover the hidden clustering structure in the variables simultaneously. Secondly, we design an efficient two-phase algorithm for solving the dual of (P). We develope a symmetric Gauss-Seidel based alternating direction method of the multipliers (sGS-ADMM) to generate an initial point to warm-start the second phase algorithm, which is a proximal augmented Lagrangian method (pALM), to get a solution with high accuracy. For solving the pALM subproblems, we use the semismooth Newton method where the sparsity and clustering structure is carefully analysed and exploited in the underlying generalized Jacobians to reduce the computational cost in each semismooth Newton iteration. Thirdly, we conduct comprehensive numerical experiments on both synthetic data and real data to demonstrate the performance of our model, as well as the efficiency and robustness of our proposed algorithm. The numerical results show that our model can rather successfully estimate the concentration matrix as well as uncovering its clustering structure.

The remaining parts of the paper are organized as follows. In Section 2, we state the problem setup and some related results in the literature. In Section 3, we describe the proposed two-phase algorithm for solving our model. In Section 4, we present the numerical results. Finally, in Section 5, we make some concluding remarks.

Throughout the paper, we use to denote a vector consisting of the diagonal entries of a matrix and to denote a diagonal matrix whose diagonal is given by a vector . For any matrix , denotes the Frobenius norm of .

## 2 Problem setup and preliminaries

In this section, we set up the problem and present some related properties of the regularization term and the function , respectively.

### 2.1 Duality and optimality conditions

### 2.2 The proximal mapping and Moreau envelope

For a given closed convex function , where is a finite dimensional real Euclidean space equipped with an inner product and its induced norm . The Moreau envelope of at is defined as

The corresponding minimizer, which is called the proximal mapping of at , is denoted as . It is proved in [24, 26] that is globally Lipschitz continuous with modulus and is finite-valued, convex and continuously differentiable with

The Moreau identity states that for any , it holds that

### 2.3 Results related ro the regularization term

Let be the linear map such that is the vector obtained from by concatenating the columns of the strictly upper triangular part of sequentially into a vector of dimension . The adjoint is such that is the operation of first putting the entries of the vector into the strictly upper triangular part of an matrix , and then symmetrizing it. Denote

Then it is obvious that

The function is the clustered lasso regularizer in the context of the linear regression models, which is studied in [3, 21, 25, 28]. The associated conjugate function, proximal mapping and the corresponding generalized Jacobian of the proximal mapping has been carefully studied in [21]. By making use of , we have that for any ,

and

Next we state the following proposition to compute for all .

###### Proposition 1.

For any , it holds that

where is the Clarke generalized Jacobian of at .

###### Proof.

The equality follows from [10, Example 2.5]. ∎

Here we present some results on the clustered lasso regularizer that are taken from [21]. Denote , where . Let be the vector whose components are those of sorted in a non-increasing order, that is . Then we have the following proposition, which describes an efficient way for evaluating .

###### Proposition 2.

(a) For any , it can be proved that

where is defined by , . Thus the computational cost of evaluating can be reduced from to .

(b) For any given , let be a permutation matrix such that is sorted in a non-increasing order. Then the proximal mapping of at can be computed as

where (the metric projection onto ) can be computed by the pool-adjacent-violators algorithm [2] in operations.

(c) For any , the proximal mapping of at can be computed as

where the sign function, the absolute value and the maximum value are taken component-wise.

Next we consider the generalized Jacobian of , which denoted as . The detailed derivation of could be found in [21]. Note that is strongly semismooth on with respect to . In the implementation of our proppsed algorithm, we need an explicitly computable element in for any given . As discussed in [21], we denote

where is the -th row of . Then we define two diagonal matrices with

and with

Based on these notations, the following proposition provides a computable element in .

###### Proposition 3.

For any , we have that

where denotes the pseduoinverse. Further details on the computation of could be found in [21, Proposition 2.8].

### 2.4 Results related to the function

The following proposition states the computation of the proximal mapping of and the corresponding Jacobian, which is directly obtained from [30, Lemma 2.1]. For simplicity, we denote for any .

###### Proposition 4.

For any given

, with its eigenvalue decomposition

, where is the vector of eigenvalues and the columns ofare the corresponding orthonormal set of eigenvectors. We assume that

. Given and the scaler function for all , we define its matrix counterpart:where is such that its -th component is given by .

(a) The proximal mapping of can be computed as

(b) is continuously differentiable and its Fréchet derivative at is given by

where is defined by

## 3 A two-phase algorithm

In this section, we propose a two-phase algorithm to solve the problem (P) based on the augmented Lagrangian function of (D). In Phase I, we design a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the problem to a moderate level of accuracy. In Phase II, we employ a proximal augmented Lagrangian method (pALM) with its subproblems solved by the semismooth Newton method (SSN) to get a solution with high accuracy. Note that the sGS-ADMM not only can be used to generate a good initial point to warm-start the pALM, it can also be used alone to solve the problem. But as a first-order method, the sGS-ADMM may not be efficient enough in some cases to solve a problem to high accuracy. Thus in the second phase, we switch to the superlinearly convergent pALM to compute an accurate solution.

### 3.1 Phase I: sGS-ADMM

A natural way to solve the problem (D) is the popular alternating direction method of the multipliers (ADMM), but as shown via a counterexample in [4], the directly extended sequential Gauss-Seidel-type multi-block ADMM may not be convergent even with a small step length. Thus, in this paper, we employ a more delicate symmetric Gauss-Seidel-type multi-block ADMM, i.e., the sGS-ADMM to solve (D). As is shown in [5], the sGS-ADMM is not only guaranteed to converge theoretically, in practice it also performs better than the possibly nonconvergent directly extended multi-block ADMM.

The Lagrangian function associated with (D) is given by

(5) |

For , the associated augmented Lagrangian function is

(6) |

Based on the augmented Lagrangian function (6), we design the sGS-ADMM for solving (D). To be specific, we update and alternatively as in the commonly used -block ADMM, but with the key difference of applying the sGS iteration technique [19] to the second block. The template for the algorithm is given as follows:

where is a given step length that is typically set to be . The implementation of updating each variable can be given as follows.

##### Updating of .

Given , can be obtained by

where and .

##### Updating of .

Given , can be obtained by solving the linear system as

##### Updating of .

Given , the updating of could be given as

where .

The whole sGS-ADMM for solving (D) can be summarized as below.

Algorithm 1 : sGS-ADMM

Input: , , , , , and .

The convergence result of the above algorithm can be obtained from [5, Theorem 5.1] without much difficulty.

### 3.2 Phase II: pALM

The augmented Lagrangian method (ALM) is a widely used method for solving the convex optimization problem in the literature. It has the important property of possessing superlinear convergence guarantee.

We write the dual problem (D) in the following unconstrained form

(D) |

Denote

Then by [27, Example 11.46], the Lagrangian function associated with (3.2) is

By [27, Example 11.57], the corresponding augmented Lagrangian function is

where .

Based on the above notations, we describe the proximal augmented Lagrangian method (pALM) for solving (3.2) as follows.

Algorithm 2 : pALM

Input: , , , , , .

(7) |

#### 3.2.1 Convergence result of the pALM

The global convergence and global linear-rate convergence of the pALM can be obtained following the idea in [20]. To establish the convergence result, we define the maximal monotone operator

and its inverse operator

As we note in the pALM, we need to specify the stopping criterion of computing the approximate solution in (7). Denote the operator

where is the identity operator over . We use the following stopping criteria for solving (7):

(A) | ||||

(B) |

where and are summable nonnegative sequences satisfying for all .

Based on the above preparation, we could present the convergence result of the pALM in the following theorem, which is a direct application of [20, Theorem 1 and Theorem 2]

###### Theorem 2.

(a) Let be the sequence generated by the pALM with the stopping criterion (A). Then is bounded, converges to an optimal solution of (3.2), and both and converge to the optimal solution of (P).

(b) Let be a positive number such that . Asuume that there exists such that

for all satisfying . Suppose that the initial point satisfies

Let be the sequence generated by the pALM with the stopping criteria (A) and (B). Then for , it holds that

where

#### 3.2.2 A semismooth Newton method for solving the pALM subproblems

As one can see, the main task in the pALM is to solve the subproblem (7) in an efficient way. Note that given , the subproblem (7) has the form of

Since is a strongly convex function on , the above minimization problem has a unique optimal solution, denoted as , which can be computed by solving the nonsmooth optimality condition:

(8) |

where .

Define the operator as

for any , . We can treat as the generalized Jacobian of at . By the analysis of the regularization term and the function in Section 2, is strongly semismooth with respect to . Thus we could apply the semismooth Newton method (SSN) to solve (8), which has the following template.

Algorithm 3 : SSN

Input: , choose , , and set .

Since the operator is positive definite, we can easily obtain the following superlinear convergence result of the SSN method from [34].

###### Theorem 3.

Let be the sequence generated by the SSN method, then converges to and

Comments

There are no comments yet.