 # Learning Gaussian Graphical Models Using Discriminated Hub Graphical Lasso

We develop a new method called Discriminated Hub Graphical Lasso (DHGL) based on Hub Graphical Lasso (HGL) by providing prior information of hubs. We apply this new method in two situations: with known hubs and without known hubs. Then we compare DHGL with HGL using several measures of performance. When some hubs are known, we can always estimate the precision matrix better via DHGL than HGL. When no hubs are known, we use Graphical Lasso (GL) to provide information of hubs and find that the performance of DHGL will always be better than HGL if correct prior information is given and will seldom degenerate when the prior information is wrong.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical models have been used widely in various systems. A graph consists of nodes, representing variables, and edges between nodes, representing the conditional dependency between two variables . In order to get a sparse and interpretable estimate of the graph, many authors have considered the optimization problem in the form , where is an data matrix, is a symmetric matrix which contains the variables of interest, and

is a loss function

. In many cases, we assume that is convex in . In reality, there exists a specific kind of nodes called hubs , which are connected to a large number of other nodes. Tan et al. (2014) propose a method called Hub Graphical Lasso (HGL) to estimate the graph with hubs . They decompose as , where represents conditional dependency between non-hub nodes and represents conditional dependency between hub nodes and other nodes. Then they use Hub Penalty Function which is the minimal of the term below with respect to such that instead of the penalty in the above optimization problem

 λ1∥∥Z−diag(Z)∥∥1+λ2∥∥V−diag(V)∥∥1+λ3∥∥(V−diag(V))j∥∥q.

Since HGL does not require prior information of hubs and the number of hubs found by HGL is always not more than that of true hubs, we hope to advance HGL by providing prior information of hubs if we have some beforehand. We introduce Discriminated Hub Penalty Function with the form , which is the minimal of the term below with respect to such that

 {λ1∥Z−diag(Z)∥1+λ2∑j∉D∥(V−diag(V))j∥1+λ3∑j∉D∥(V−diag(V))j∥q +λ4∑j∈D∥(V−diag(V))j∥1+λ5∑j∈D∥(V−% diag(V))j∥q}. (1)

Here contains prior information of hubs and we impose constraints that . In the penalty, we give “loose conditions” for nodes in so that they tend to be identified as hubs in our estimate. We will show that DHGL always performs better than HGL and keeps stable even when wrong prior information is provided.

## 2 Methology

### 2.1 Discriminated Hub Graphical Lasso

#### 2.1.1 Optimization Problem

We continue the definitions and notations in Introduction. Combining Discriminated Hub Penalty Function with the loss function, we have the convex optimization problem

 % minimizeΘ∈S,V,Z {ℓ(X,Θ)+λ1∥Z−diag(Z)∥1+λ2∑j∉D∥(V−diag(V))j∥1 +λ3∑j∉D∥(V−diag(V))j∥q+λ4∑j∈D∥(V−% diag(V))j∥1 +λ5∑j∈D∥(V−diag(V))j∥q}; subject toΘ=V+VT+Z,

where depends on the loss function .

Similar to that in Tan et al. (2014), we encourage the solution of to be a sparse symmetric matrix, and to be a matrix with columns either entirely zero or almost entirely non-zero. Here controls the sparsity in . For variables in , controls the hub nodes selection, and controls the sparsity of each hub’s connections to other nodes. Similar for and for variables not in . Here we set , same as Tan et al. (2014) [2, 3].

As we can see, the “discrimination” involves using different tuning parameters controlling hub selection and hub sparsity for different columns in , or different variables (variables in vs. not in ). When , it reduces to the convex optimization problem corresponding to Hub Penalty Function in Tan et al. (2014). When , it reduces to the classical way to obtain a sparse graph estimate. Generally, contains variables to be less penalized with and .

When , we simply set , where is the empirical covariance matrix of . Then the Discriminated Hub Graphical Lasso (DHGL) optimization problem is

 minimizeΘ∈S {−logdetΘ+trace(SΘ)+P(Θ)}, (2)

with . Again, when , it reduces to Hub Graphical Lasso (HGL) in Tan et al. (2014), and when , it reduces to the classical Graphical Lasso (GL) .

#### 2.1.2 Computational Complexity of ADMM

We can use Alternating Direction Method of Multipliers (ADMM) to solve the convex optimization problems proposed above . The algorithm details as well as derivations are very similar to those of Algorithm 1 in Tan et al. (2014) . Details are displayed in Supplement Section 2.

Also notice that the computational complexity is per iteration for solving DHGL, same in magnitude as HGL. We have also performed a simulation study to illustrate this. The simulation setup is very similar to that of “DHGL applied with known hubs” in Results and Analysis section except that we change . In the first part, we set and also change related parameters such as , and proportionally. In the second part, we set but fix all other related parameters the same as the case when . Results averaged over 25 replications are displayed in Table 1. We have found that the run time per iteration scale the similar way with regardless of other related parameters such as . Also, the number of iterations increases as increase. Moreover, HGL and DHGL performs similarly in terms of run time and number of iterations.

### 2.2 Tuning Parameter Selection

We can select tuning parameters by minimizing the same BIC-type quantity as proposed by Tan et al. (2014)

 BIC(^Θ,^V,^Z)=−n⋅logdet(^Θ)+n⋅trace(S^Θ)+log(n)⋅|^Z|+log(n)⋅(ν+c⋅[|^V|−ν]),

where the number of estimated hub nodes , and is a constant . Same as Tan et al. (2014), here we choose for our simulation experiments.

In order for both optimal solutions for and , i.e., and to be non-diagonal, we can select tuning parameters such that

 λ22+λ32(p−1)1s≤λ1≤λ2+λ32,λ4≤λ2,λ5≤λ3,1s+1q=1.

Detailed proof is displayed in Supplement Section 1.

### 2.3 Measurement of Performance

The precision matrix determines the conditional dependency between nodes . For , we know that if , a true edge between and exists. If further , this is also a hub edge. Then we set a tolerance value (in our experiment, we use ). If , an edge between and is identified. The estimated hubs are defined by the set of nodes that have at least edges. Then we can define the number of correctly estimated edges, the proportion of correctly estimated hub edges, the proportion of correctly estimated hub nodes, the sum of squared errors between and simply by their names, which are also nearly the same as those defined in Tan et al. (2014) . We have also defined the accuracy rate of hubs by the proportion of all nodes that are correctly estimated as hub nodes or correctly estimated as non-hub nodes.

### 2.4 Obtain Set D

#### 2.4.1 DHGL Applied with Known Hubs

Tan et al. (2014) propose HGL assuming they do not know a priori which nodes are hubs . However, before estimating the dependency structures among different nodes, we sometimes have domain knowledge of some dependency. In this case, we hope to use the prior information of hubs to estimate more accurately. In this section, we give an algorithm using DHGL to estimate when some hubs have been known.

In HGL, when and become smaller, it is more likely to find more edges and hubs since all of the entries in have less penalty and tend to become larger. However, while we find more correct edges and hubs as and become smaller, the number of wrong edges and hubs we find also increase. Thus, if we have known some true hubs or some prior information of nodes which are likely to be hubs, we hope to give “loose conditions” only for the entries of their columns of to become larger rather than all of the nodes. That’s why we let contain the prior information and propose and which are not greater than and respectively. If a known hub does not belong to which means the corresponding column of contains many zeros, some entries of the corresponding column of will become larger and more correct edges will be identified. On the other hand, if a node belongs to which means the corresponding column of contains many nonzeros and we still set it in , the corresponding column of will be overamplified which may even become worse than . Thus, we set to avoid this problem. Furthermore, while the algorithm finds more correct edges of the discriminated nodes, a few correct edges found by HGL may disappear so that some nodes in may not belong to . Thus, we need to combine and so that we will not lose useful information of hubs.

#### 2.4.2 DHGL Applied without Known Hubs

In the previous section, we discuss the application of DHGL when some hubs are known. However, we often do not have any prior information of hubs. In this section, we give an algorithm using GL to provide prior information and DHGL to estimate .

The BIC-type quantity can help us select the tuning parameters to get a relatively accurate estimate of . However, is always not more than the number of true hubs because of the penalty term of the number of hubs in the BIC-type quantity, especially when is relatively large. Based on this fact, we try to find extra prior information of hubs which are not in . Since GL is fast (less than for sparse ) , we can adjust the regularization parameter of GL from large to small quickly until we get nodes that belong to rather than . These extra nodes are likely to be true hubs if although there is no guarantee for this. Thus, we set the extra nodes in . Since the prior information provided by GL may be wrong, we keep for conservatism and we only need to select using the BIC-type quantity in DHGL. If the extra nodes are true hubs, tends to be small to make the BIC-type quantity small so that more true edges and hubs will be found. If the extra nodes are not true hubs, tends to keep the same as and the estimation of will be the same as that in HGL. In a word, the new algorithm tends to accept correct prior information to make improvement and reject false one to keep stable.

## 3 Results and Analysis

In the simulation studies, the precision matrix and the set of indices of hub nodes are given, and we estimate using our approaches displayed above.

### 3.1 DHGL Applied with Known Hubs

In the experiment, firstly, we use the simulation set-up I in Tan et al. (2014) with , , , and suppose we know two true hubs randomly . We fix and consider two cases where and . In each case, ranges from 0.1 to 0.7. In DHGL, for simplicity of the experiment, we do not use the BIC-type quantity to select and but set relatively smaller than and . For every set of tuning parameters, we conduct 50 simulations and average the results. Figure 1 shows the comparison of measures of performance between HGL and DHGL.

From Figure 1, we see that for every set of tuning parameters, the number of correctly estimated edges, the proportion of correctly estimated hub edges and the sum of squared errors perform better in DHGL than those in HGL. For the effective proportion of correctly estimated hubs and the effective accuracy rate of hubs, HGL performs better than DHGL reasonably since the discriminated nodes (known hubs) are not considered in the calculation of the two measures. In order not to lose useful information, we combine the hubs found by HGL and DHGL so that we can always get better information of hubs using our method.

Moreover, we consider three other cases for , where , , , and other parameters are set to the same. Figure 2 displays performances of 50 simulations when . All of the three results are similar to the case when .

### 3.2 DHGL Applied without Known Hubs

In the experiment, we use the simulation set-up I in Tan et al. (2014) with , , and . We fix , and select from using the BIC-type quantity. Then for prior information screening, we set and . In DHGL, we select from using the BIC-type quantity with . Figure 3 and 4 display the comparison of 50 simulations between HGL and DHGL when and respectively.

From Figure 3, we see that for which is relatively large, all of the five measures in DHGL outperform those in HGL. Even when the accuracy rate of hubs in HGL equal to 1 (4 of 50) which means the discriminated nodes are not true hubs, all of the performances in DHGL do not degenerate.

From Figure 4, we see that for which is relatively small, the majority of accuracy rates of hubs in HGL equal to 1 (31 of 50). Among these cases, only 5 accuracy rates of hubs in DHGL degenerate and the other measures of the 5 also degenerate slightly. However, when the accuracy rates of hubs in HGL is less than 1 which means , all the measures of the majority (14 of 19) outperform those in HGL obviously.

These two results display both the advantages and robustness of DHGL applied without known hubs. When , DHGL can always perform better than HGL. On the other hand, when which is not always the case, the performance of DHGL seldom degenerates and always keep the same as that of HGL.

## 4 Discussion

Based on Hub Graphical Lasso proposed by Tan et al. (2014), we propose Discriminated Hub Graphical Lasso to estimate the precision matrix when some prior information of hubs is known. For the two situations where some hubs are known or no hubs are known, we propose two algorithms respectively to make some improvement in the measures of performance given by Tan et al. (2014) and us. The improvement essentially results from the increase of the number of tuning parameters for the prior information and the BIC-type quantity proposed by Tan et al. (2014) will be smaller than that in HGL by selecting the optimal parameters. The computational complexity is still per iteration in DHGL, same as that in HGL. However, the two algorithms need us to use both HGL and DHGL which will certainly cost more time but in the same scale. Moreover, in both algorithms, only the prior information of hubs which have not been found by HGL are set in so sometimes we cannot make use of the prior information when . Thus, it remains open problems how to use DHGL only to make improvement and how to use all prior information of hubs.

For Algorithm 2, we do not give a certain method about how to select the optimal a and b when screening prior information using Graphical Lasso. Instead, we select relatively small and for conservatism which will limit the performance of the algorithm. For the future work, we hope to figure out how to select the optimal and , which means the difference between the number of estimated hubs in HGL and the number of true hubs should be studied more deeply. Moreover, we will try to find other methods instead of GL to search for prior information of hubs.

## References

•  Drton M., & Maathuis MH. (2017) Structure Learning in Graphical Modeling. Annual Review of Statistics and Its Application, Vol. 4.
•  Tan KM., London P., Mohan K., Lee S., Fazel M., & Witten D. (2014) Learning Graphical Models With Hubs.

Journal of Machine Learning Research,

15: 3297-3331.
•  Yuan M., & Lin Y. (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49-67.
•  Friedman J., Hastie T., & Tibshirani R. (2007) Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432-441.
•  Boyd S., Parikh N., Chu E., Peleato B., & Eckstein J. (2010) Distributed optimization and statistical learning via the ADMM. Foundations and Trends in Machine Learning, 3(1):1-122.

## Tables and Figures Figure 1: Measures of performance (Number of correctly estimated edges, Proportion of correctly estimated hub edges, Effective proportion of correctly estimated hub nodes, Sum of squared errors between Θ and ^Θ, Effective accuracy rate of hubs) of HGL and DHGL when some hubs are known, where Effective proportion of correctly estimated hub nodes and Effective accuracy rate of hubs are calculated not considering known hubs. The x-axes display λ2 and the y-axes display the five measures. The blue, green, orange, red lines correspond to (λ3=1,HGL),(λ3=1,DHGL),(λ3=1.5,% HGL),(λ3=1.5,DHGL) respectively. 50 simulations are conducted for each set of parameters. Figure 2: Measures of performance (Number of correctly estimated edges, Proportion of correctly estimated hub edges, Effective proportion of correctly estimated hub nodes, Sum of squared errors between Θ and ^Θ, Effective accuracy rate of hubs) of HGL and DHGL when some hubs are known, where Effective proportion of correctly estimated hub nodes and Effective accuracy rate of hubs are calculated not considering known hubs. The x-axes display the 50 simulations and the y-axes display the five measures. The results are for p=300, n=100 and (λ1,λ2,λ3,λ4,λ5)=(0.4,0.4,1,0.2,0.1). The blue and green lines correspond to HGL and DHGL respectively. Figure 3: Measures of performance (Number of correctly estimated edges, Proportion of correctly estimated hub edges, Proportion of correctly estimated hub nodes, Sum of squared errors between Θ and ^Θ, Accuracy rate of hubs) of HGL and DHGL when no hubs are known. The x-axes display the 50 simulations and the y-axes display the five measures. The results are for p=150, n=50 and |H|=10. The blue and green lines correspond to HGL and DHGL respectively. Here, we sort the accuracy rate of hubs of the 50 samples. Figure 4: Measures of performance (Number of correctly estimated edges, Proportion of correctly estimated hub edges, Proportion of correctly estimated hub nodes, Sum of squared errors between Θ and ^Θ, Accuracy rate of hubs) of HGL and DHGL when no hubs are known. The x-axes display the 50 simulations and the y-axes display the five measures. The results are for p=150, n=50 and |H|=5. The blue and green lines correspond to HGL and DHGL respectively. Here, we sort the accuracy rate of hubs of the 50 samples.

## 5 Supplement

### 5.1 Proof of Theorem in Selecting Tunning Parameters

Recall that the optimization problem is

 % minimizeΘ∈S,V,Z {ℓ(X,Θ)+λ1∥Z−diag(Z)∥1+λ2∑j∉D∥(V−diag(V))j∥1 +λ3∑j∉D∥(V−diag(V))j∥q+λ4∑j∈D∥(V−% diag(V))j∥1 +λ5∑j∈D∥(V−diag(V))j∥q} subject to Θ=V+VT+Z, (3)

where the set depends on the loss function .

Let and denote the optimal solution for and in (5.1) and .

Theorem 1: If , then , for .

Proof: Let be the solution to (5.1) and suppose such that .

Let for and for and

Also we construct as follows

 ^Zij={Z⋆ij+V⋆ij+V⋆jifor ∀j∉D,∀i∉D,i≠jZ⋆ijotherwise (4)

Then we have . We now show that has a smaller objective value than , which is a contradiction.

Firstly, we have

 λ1∥∥^Z−diag(^Z)∥∥1+λ2∑j∉D∥∥(^V−diag(^V))j∥∥1+λ4∑j∈D∥∥(^V−diag(^V))j∥∥1+λ3∑j∉D∥∥(^V−diag(^V))j∥∥q+λ5∑j∈D∥∥(^V−diag(^V))j∥∥q=λ1∥∥^Z−diag(^Z)∥∥1+0+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+0+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q≤λ1∥∥Z⋆−diag% (Z⋆)∥∥1+2λ1∑i∉D,j∉D,i≠j|V⋆ij|+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q≤λ1∥∥Z⋆−diag% (Z⋆)∥∥1+2λ1∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q=λ1∥∥Z⋆−diag(Z⋆)∥∥1+λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+(2λ1−λ2)∑j∉D∥∥(V⋆−diag% (V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q.

By Holder’s Inequality , where and . Let , then we have

Hence, .

Hence,

 λ1∥∥^Z−diag(^Z)∥∥1+λ2∑j∉D∥∥(^V−diag(^V))j∥∥1+λ4∑j∈D∥∥(^V−diag(^V))j∥∥1+λ3∑j∉D∥∥(^V−diag(^V))j∥∥q+λ5∑j∈D∥∥(^V−diag(^V))j∥∥q≤λ1∥∥Z⋆−diag% (Z⋆)∥∥1+λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+(2λ1−λ2)(p−1)1s∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q<λ1∥∥Z⋆−diag(Z⋆)∥∥1+λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q.

For the first inequality, we assume . On the other hand, if , the last inequality holds obviously. Thus, this leads to a contradiction.

Theorem 2: If , then is a diagonal matrix.

Proof: Let be the solution to (5.1) and suppose is not a diagonal matrix. Note that is symmetric since . Let and also construct as follows

 ^Vij=⎧⎨⎩V⋆ij+Z⋆ij2i≠jV⋆jjotherwise (5)

Then we have . We now show that has a smaller objective value than which is a contradiction.

Note that

 λ1∥∥^Z−diag(^Z)∥∥1+λ2∑j∉D∥∥(^V−diag(^V))j∥∥1+λ4∑j∈D∥∥(^V−diag(^V))j∥∥1=λ2∑j∉D∥∥(^V−diag(^V))j∥∥1+λ4∑j∈D∥∥(^V−diag(^V))j∥∥1=λ2∑j∉D,i≠j|(V⋆ij+Z⋆ij2)|+λ4∑j∈D,i≠j|(V⋆ij+Z⋆ij2)|≤λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ22∑j∉D,i≠j|Z⋆ij|+λ42∑j∈D,i≠j|Z⋆ij|≤λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ22∥∥(Z⋆−%diag(Z⋆))∥∥1.

And also,

 λ3∑j∉D∥∥(^V−diag(^V))j∥∥q+λ5∑j∈D∥∥(^V−diag(^V))j∥∥q≤λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ32∑j∉D∥∥(Z⋆−diag(Z⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q+λ52∑j∈D∥∥(Z⋆−diag(Z⋆))j∥∥q≤λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q+λ32p∑j=1∥∥(Z⋆−diag(Z⋆))j∥∥q≤λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q+λ32∥∥Z⋆−% diag(Z⋆)∥∥1,

where the last inequality follows from the fact that for any vector

and is a non-increasing function of .

Sum up the two inequalities above and we have

 λ1∥∥^Z−diag(^Z)∥∥1+λ2∑j∉D∥∥(^V−diag(^V))j∥∥1+λ4∑j∈D∥∥(^V−diag(^V))j∥∥1+λ3∑j∉D∥∥(^V−diag(^V))j∥∥q+λ5∑j∈D∥∥(^V−diag(^V))j∥∥q≤λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆))j∥∥1+λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q+λ2+λ32∥∥Z⋆−diag(Z⋆)∥∥1<λ2∑j∉D∥∥(V⋆−diag(V⋆))j∥∥1+λ4∑j∈D∥∥(V⋆−diag(V⋆)j∥∥1+λ3∑j∉D∥∥(V⋆−diag(V⋆))j∥∥q+λ5∑j∈D∥∥(V⋆−diag(V⋆))j∥∥q+λ1∥∥Z⋆−diag(Z⋆)∥∥1,

From the two theorems above, here comes a corollary.

Corollary: Under constraints and , if is non-diagonal and , such that , then

 λ22+λ32(p−1)1s≤λ1≤λ2+λ32.

### 5.2 ADMM Algorithm for DHGL

Recall the convex optimization problem (5.1).

Very similar to the work by Tan et al. (2014), let , ,

and

 g(~B)={0if ~Θ=~V+~VT+~Z∞otherwise.

Then, we can rewrite (5.1) as

 minimize B,~B{f(B)+g(~B)}subject to B=~B. (6)

Firstly, we derive the scaled augmented Lagrangian for (6)

 L