A Fast and Scalable Joint Estimator for Integrating Additional Knowledge in Learning Multiple Related Sparse Gaussian Graphical Models

06/01/2018
by   Beilun Wang, et al.
0

We consider the problem of including additional knowledge in estimating sparse Gaussian graphical models (sGGMs) from aggregated samples, arising often in bioinformatics and neuroimaging applications. Previous joint sGGM estimators either fail to use existing knowledge or cannot scale-up to many tasks (large K) under a high-dimensional (large p) situation. In this paper, we propose a novel Joint Elementary Estimator incorporating additional Knowledge (JEEK) to infer multiple related sparse Gaussian Graphical models from large-scale heterogeneous data. Using domain knowledge as weights, we design a novel hybrid norm as the minimization objective to enforce the superposition of two weighted sparsity constraints, one on the shared interactions and the other on the task-specific structural patterns. This enables JEEK to elegantly consider various forms of existing knowledge based on the domain at hand and avoid the need to design knowledge-specific optimization. JEEK is solved through a fast and entry-wise parallelizable solution that largely improves the computational efficiency of the state-of-the-art O(p^5K^4) to O(p^2K^4). We conduct a rigorous statistical analysis showing that JEEK achieves the same convergence rate O((Kp)/n_tot) as the state-of-the-art estimators that are much harder to compute. Empirically, on multiple synthetic datasets and two real-world data, JEEK outperforms the speed of the state-of-arts significantly while achieving the same level of prediction accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/09/2017

A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models

Estimating multiple sparse Gaussian Graphical Models (sGGMs) jointly for...
05/11/2016

A constrained L1 minimization approach for estimating multiple Sparse Gaussian or Nonparanormal Graphical Models

Identifying context-specific entity networks from aggregated data is an ...
10/30/2017

Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure

We focus on the problem of estimating the change in the dependency struc...
09/15/2015

Large-Scale Optimization Algorithms for Sparse Conditional Gaussian Graphical Models

This paper addresses the problem of scalable optimization for L1-regular...
11/21/2013

Learning Pairwise Graphical Models with Nonlinear Sufficient Statistics

We investigate a generic problem of learning pairwise exponential family...
04/26/2002

Qualitative Analysis of Correspondence for Experimental Algorithmics

Correspondence identifies relationships among objects via similarities a...
03/26/2020

A partial graphical model with a structural prior on the direct links between predictors and responses

This paper is devoted to the estimation of a partial graphical model wit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Technology revolutions in the past decade have collected large-scale heterogeneous samples from many scientific domains. For instance, genomic technologies have delivered petabytes of molecular measurements across more than hundreds of types of cells and tissues from national projects like ENCODE (Consortium et al., 2012) and TCGA (Network et al., 2011). Neuroimaging technologies have generated petabytes of functional magnetic resonance imaging (fMRI) datasets across thousands of human subjects (shared publicly through projects like openfMRI (Poldrack et al., 2013). Given such data, understanding and quantifying variable graphs from heterogeneous samples about multiple contexts is a fundamental analysis task.

Such variable graphs can significantly simplify network-driven studies about diseases (Ideker & Krogan, 2012), can help understand the neural characteristics underlying clinical disorders (Uddin et al., 2013) and can allow for understanding genetic or neural pathways and systems. The number of contexts (denoted as ) that those applications need to consider grows extremely fast, ranging from tens (e.g., cancer types in TCGA (Network et al., 2011)) to thousands (e.g., number of subjects in openfMRI (Poldrack et al., 2013)). The number of variables (denoted as ) ranges from hundreds (e.g., number of brain regions) to tens of thousands (e.g., number of human genes).

The above data analysis problem can be formulated as jointly estimating conditional dependency graphs on a single set of variables based on heterogeneous samples accumulated from distinct contexts. For homogeneous data samples from a given -th context, one typical approach is the sparse Gaussian Graphical Model (sGGM) (Lauritzen, 1996; Yuan & Lin, 2007). sGGM assumes samples are independently and identically drawn from

, a multivariate Gaussian distribution with mean vector

and covariance matrix . The graph structure is encoded by the sparsity pattern of the inverse covariance matrix, also named precision matrix, . . if and only if in an edge does not connect -th node and -th node (i.e., conditional independent). sGGM imposes an penalty on the parameter to achieve a consistent estimation under high-dimensional situations. When handling heterogeneous data samples, rather than estimating sGGM of each condition separately, a multi-task formulation that jointly estimates different but related sGGMs can lead to a better generalization(Caruana, 1997).

Previous studies for joint estimation of multiple sGGMs roughly fall into four categories: (Danaher et al., 2013; Mohan et al., 2013; Chiquet et al., 2011; Honorio & Samaras, 2010; Guo et al., 2011; Zhang & Wang, 2012; Zhang & Schneider, 2010; Zhu et al., 2014): (1) The first group seeks to optimize a sparsity regularized data likelihood function plus an extra penalty function to enforce structural similarity among multiple estimated networks. Joint graphical lasso (JGL) (Danaher et al., 2013) proposed an alternating direction method of multipliers (ADMM) based optimization algorithm to work with two regularization functions (). (2) The second category tries to recover the support of using sparsity penalized regressions in a column by column fashion. Recently (Monti et al., 2015) proposed to learn population and subject-specific brain connectivity networks via a so-called “Mixed Neighborhood Selection” (MSN) method in this category. (3) The third type of methods seeks to minimize the joint sparsity of the target precision matrices under matrix inversion constraints. One recent study, named SIMULE (Shared and Individual parts of MULtiple graphs Explicitly) (Wang et al., 2017b), automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts (i.e. by modeling each precision matrix as ) via the constrained minimization. Following the CLIME estimator (Pang et al., 2014), the constrained

convex formulation can also be solved column by column via linear programming. However, all three categories of aforementioned estimators are difficult to scale up when the dimension

or the number of tasks are large because they cannot avoid expensive steps like SVD (Danaher et al., 2013) for JGL, linear programming for SIMULE or running multiple iterations of expensive penalized regressions in MNS. (4) The last category extends the so-called ”Elementary Estimator” graphical model (EE-GM) formulation (Yang et al., 2014b) to revise JGL’s penalized likelihood into a constrained convex program that minimizes (). One proposed estimator FASJEM (Wang et al., 2017a) is solved in an entry-wise manner and group-entry-wise manner that largely outperforms the speed of its JGL counterparts. More details of the related works are in Section (5).

One significant caveat of state-of-the-art joint sGGM estimators is the fact that little attention has been paid to incorporating existing knowledge of the nodes or knowledge of the relationships among nodes in the models. In addition to the samples themselves, additional information is widely available in real-world applications. In fact, incorporating the knowledge is of great scientific interest. A prime example is when estimating the functional brain connectivity networks among brain regions based on fMRI samples, the spatial position of the regions are readily available. Neuroscientists have gathered considerable knowledge regarding the spatial and anatomical evidence underlying brain connectivity (e.g., short edges and certain anatomical regions are more likely to be connected (Watts & Strogatz, 1998)). Another important example is the problem of identifying gene-gene interactions from patients’ gene expression profiles across multiple cancer types. Learning the statistical dependencies among genes from such heterogeneous datasets can help to understand how such dependencies vary from normal to abnormal and help to discover contributing markers that influence or cause the diseases. Besides the patient samples, state-of-the-art bio-databases like HPRD (Prasad et al., 2009) have collected a significant amount of information about direct physical interactions among corresponding proteins, regulatory gene pairs or signaling relationships collected from high-qualify bio-experiments.

Although being strong evidence of structural patterns we aim to discover, this type of information has rarely been considered in the joint sGGM formulation of such samples. To the authors’ best knowledge, only one study named as W-SIMULE tried to extend the constrained minimization in SIMULE into weighted for considering spatial information of brain regions in the joint discovery of heterogeneous neural connectivity graphs (Singh et al., 2017). This method was designed just for the neuroimaging samples and has time cost, making it not scalable for large-scale settings (more details in Section 3).

This paper aims to fill this gap by adding additional knowledge most effectively into scalable and fast joint sGGM estimations. We propose a novel model, namely Joint Elementary Estimator incorporating additional Knowledge (JEEK), that presents a principled and scalable strategy to include additional knowledge when estimating multiple related sGGMs jointly. Briefly speaking, this paper makes the following contributions:

  • [noitemsep]

  • Novel approach: JEEK presents a new way of integrating additional knowledge in learning multi-task sGGMs in a scalable way. (Section 3)

  • Fast optimization: We optimize JEEK through an entry-wise and group-entry-wise manner that can dramatically improve the time complexity to . (Section 3.4)

  • Convergence rate: We theoretically prove the convergence rate of JEEK as . This rate shows the benefit of joint estimation and achieves the same convergence rate as the state-of-the-art that are much harder to compute. (Section 4)

  • Evaluation: We evaluate JEEK using several synthetic datasets and two real-world data, one from neuroscience and one from genomics. It outperforms state-of-the-art baselines significantly regarding the speed. (Section 6)

JEEK provides the flexibility of using () different weight matrices representing the extra knowledge. We try to showcase a few possible designs of the weight matrices in Section 12, including (but not limited to):

  • Spatial or anatomy knowledge about brain regions;

  • Knowledge of known co-hub nodes or perturbed nodes;

  • Known group information about nodes, such as genes belonging to the same biological pathway or cellular location;

  • Using existing known edges as the knowledge, like the known protein interaction databases for discovering gene networks (a semi-supervised setting for such estimations).

We sincerely believe the scalability and flexibility provided by JEEK can make structure learning of joint sGGM feasible in many real-world tasks.

Att: Due to space limitations, we have put details of certain contents (e.g., proofs) in the appendix. Notations with “S:” as the prefix in the numbering mean the corresponding contents are in the appendix. For example, full proofs are in Section (10).

Notations: math notations we use are described in Section (8). is the total number of data samples. The Hadamard product is the element-wise product between two matrices. Also to simplify the notations, we abuse the notation to represent a new matrix being generated by element wise division of each entry in by .

2 Background

Sparse Gaussian graphical model (sGGM):

The classic formulation of estimating sparse Gaussian Graphical model (Yuan & Lin, 2007) from a single given condition (single sGGM) is the “graphical lasso” estimator (GLasso) (Yuan & Lin, 2007; Banerjee et al., 2008). It solves the following penalized maximum likelihood estimation (MLE) problem:

(1)
M-Estimator with Decomposable Regularizer in High-Dimensional Situations:

Recently the seminal study (Negahban et al., 2009) proposed a unified framework for high-dimensional analysis of the following general formulation: M-estimators with decomposable regularizers:

(2)

where represents a decomposable regularization function and

represents a loss function (e.g., the negative log-likelihood function in sGGM

). Here is the tuning parameter.

Elementary Estimators (EE):

Using the analysis framework from  (Negahban et al., 2009), recent studies (Yang et al., 2014a, b, c) propose a new category of estimators named “Elementary estimator” (EE) with the following general formulation:

(3)

Where is the dual norm of ,

(4)

The solution of Eq. (35) achieves the near optimal convergence rate as Eq. (2) when satisfying certain conditions. represents a decomposable regularization function (e.g., -norm) and is the dual norm of (e.g., -norm is the dual norm of -norm). is a regularization parameter.

The basic motivation of Eq. (35) is to build simpler and possibly fast estimators, that yet come with statistical guarantees that are nonetheless comparable to regularized MLE. needs to be carefully constructed, well-defined and closed-form for the purpose of simpler computations. The formulation defined by Eq. (35) is to ensure its solution having the desired structure defined by

. For cases of high-dimensional estimation of linear regression models,

can be the classical ridge estimator that itself is closed-form and with strong statistical convergence guarantees in high-dimensional situations.

EE-sGGM:

(Yang et al., 2014b) proposed elementary estimators for graphical models (GM) of exponential families, in which represents so-called proxy of backward mapping for the target GM (more details in Section 11). The key idea (summarized in the upper row of Figure 1) is to investigate the vanilla MLE and where it “breaks down” for estimating a graphical model of exponential families in the case of high-dimensions (Yang et al., 2014b)

. Essentially the vanilla graphical model MLE can be expressed as a backward mapping that computes the model parameters from some given moments in an exponential family distribution. For instance, in the case of learning Gaussian GM (GGM) with vanilla MLE, the backward mapping is

that estimates from the sample covariance matrix (moment) . We introduce the details of backward mapping in Section 11.

However, even though this backward mapping has a simple closed form for GGM, the backward mapping is normally not well-defined in high-dimensional settings. When given the sample covariance , we cannot just compute the vanilla MLE solution as for GGM since is rank-deficient when . Therefore Yang et al. (Yang et al., 2014b) used carefully constructed proxy backward maps as that is both available in closed-form, and well-defined in high-dimensional settings for GGMs. We introduce the details of and its statistical property in Section 11. Now Eq. (35) becomes the following closed-form estimator for learning sparse Gaussian graphical models  (Yang et al., 2014b):

(5)

Eq. (5) is a special case of Eq. (35), in which is the off-diagonal -norm and the precision matrix is the we search for. When is the -norm, the solution of Eq. (35) (and Eq. (5)) just needs to perform entry-wise thresholding operations on to ensure the desired sparsity structure of its final solution.

Figure 1: Basic idea of JEEK.

3 Proposed Method: JEEK

In applications of Gaussian graphical models, we typically have more information than just the data samples themselves. This paper aims to propose a simple, scalable and theoretically-guaranteed joint estimator for estimating multiple sGGMs with additional knowledge in large-scale situations.

3.1 A Joint EE (JEE) Formulation

We first propose to jointly estimate multiple related sGGMs from data blocks using the following formulation:

(6)

where denotes the precision matrix for -th task. describes the negative log-likelihood function in sGGM. means that needs to be a positive definite matrix. represents a decomposable regularization function enforcing sparsity and structure assumptions (details in Section (3.2)).

For ease of notation, we denote that and . and are both matrices (i.e., parameters to estimate). Now define an inverse function as , where is a given matrix with the same structure as . Then we rewrite Eq. (6) into the following form:

(7)

Now connecting Eq. (7) to Eq. (2) and Eq. (35), we propose the following joint elementary estimator (JEE) for learning multiple sGGMs:

(8)

The fundamental component in Eq. (35) for the single context sGGM was to use a well-defined proxy function to approximate the vanilla MLE solution (named as the backward mapping for exponential family distributions) (Yang et al., 2014b). The proposed proxy is both well-defined under high-dimensional situations and also has a simple closed-form. Following a similar idea, when learning multiple sGGMs, we propose to use for and get the following joint elementary estimator:

(9)

3.2 Knowledge as Weight (KW-Norm)

The main goal of this paper is to design a principled strategy to incorporate existing knowledge (other than samples or structured assumptions) into the multi-sGGM formulation. We consider two factors in such a design:

(1) When learning multiple sGGMs jointly from real-world applications, it is often of great scientific interests to model and learn context-specific graph variations explicitly, because such variations can “fingerprint” important markers in domains like cognition (Ideker & Krogan, 2012) or pathology (Kelly et al., 2012). Therefore we design to share parameters between different contexts. Mathematically, we model as two parts:

(10)

where is the individual precision matrix for context and is the shared precision matrix between contexts. Again, for ease of notation we denote and .

(2) We represent additional knowledge as positive weight matrices from . More specifically, we represent the knowledge of the task-specific graph as weight matrix and representing existing knowledge of the shared network. The positive matrix-based representation is a powerful and flexible strategy that can describe many possible forms of existing knowledge. In Section (12), we provide a few different designs of and for real-world applications. In total, we have weight matrices to represent additional knowledge. To simplify notations, we denote and .

Now we propose the following knowledge as weight norm (kw-norm) combining the above two:

(11)

Here the Hadamard product is the element-wise product between two matrices i.e. .

The kw-norm( Eq. (11)) has the following three properties:

  • (i) kw-norm is a norm function if and only if any entries in and do not equal to .

  • (ii) If the condition in (i) holds, kw-norm is a decomposable norm.

  • (iii) If the condition (i) holds, the dual norm of kw-norm is 222In our previous version, we mistakenly wrote . We correct relevant equations here. Our code implementation was correct, thus not change.

Section 10.1 provides proofs of the above claims.

3.3 JEE with Knowledge (JEEK)

Plugging Eq. (11) to Eq. (9), we obtain the following formulation of JEEK for learning multiple related sGGMs from heterogeneous samples:

(12)

In Section 4

, we theoretically prove that the statistical convergence rate of JEEK achieves the same sharp convergence rate as the state-of-the-art estimators for multi-task sGGMs. Our proofs are inspired by the unified framework of the high-dimensional statistics 

(Negahban et al., 2009).

3.4 Solution of JEEK:

A huge computational advantage of JEEK (Eq. (12)) is that it can be decomposed into independent small linear programming problems. To simplify notations, we denote (the -th entry of ) as . Similarly we denote as and be . Similarly we denote and . ”A group of entries” means a set of parameters for certain .

In order to estimate , JEEK (Eq. (12)) can be decomposed into the following formulation for a certain :

(13)

Eq. (13) can be easily converted into a linear programming form of Eq. (8.1) with only variables. The time complexity of Eq. (13) is . Considering JEEK has a total of such subproblems to solve, the computational complexity of JEEK (Eq. (12)) is therefore . We summarize the optimization algorithm of JEEK in Algorithm 1 (details in Section (8.2)).

Input: Data sample matrix ( to

), regularization hyperparameter

, Knowledge weight matrices and LP(.) (a linear programming solver)

Output: ( to )

1:  for  to  do
2:      Initialize (the sample covariance matrix of )
3:      Initialize
4:      Calculate the proxy backward mapping
5:  end for
6:  for  to  do
7:      for  to  do
8:          
9:          
10:          
11:           where and LP(.) solves Eq. (13)
12:          for  to  do
13:              
14:              
15:              
16:          end for
17:      end for
18:  end for
Algorithm 1 Joint Elementary Estimator with additional knowledge (JEEK) for Multi-task sGGMs

4 Theoretical Analysis

KW-Norm:

We presented the three properties of kw-norm in Section 3.2. The proofs of these three properties are included in Section (10.1).

Theoretical error bounds of Proxy Backward Mapping:

(Yang et al., 2014b) proved that when (), the proxy backward mapping used by EE-sGGM achieves the sharp convergence rate to its truth (i.e., by proving ). The proof was extended from the previous study (Rothman et al., 2009) that devised for estimating covariance matrix consistently in high-dimensional situations. See detailed proofs in Section 11.3. To derive the statistical error bound of JEEK, we need to assume that are well-defined. This is ensured by assuming that the true satisfy the conditions defined in Section (10.1).

Theoretical error bounds of JEEK:

We now use the high-dimensional analysis framework from (Negahban et al., 2009), three properties of kw-norm, and error bounds of backward mapping from (Rothman et al., 2009; Yang et al., 2014b) to derive the statistical convergence rates of JEEK. Detailed proofs of the following theorems are in Section 4 .

Before providing the theorem, we need to define the structural assumption, the IS-Sparsity, we assume for the parameter truth.
(IS-Sparsity): The ’true’ parameter of can be decomposed into two clear structures– and . is exactly sparse with non-zero entries indexed by a support set and is exactly sparse with non-zero entries indexed by a support set . . All other elements equal to (in ).

Consider whose true parameter satisfies the (IS-Sparsity) assumption. Suppose we compute the solution of Eq. (12) with a bounded such that , then the estimated solution satisfies the following error bounds:

(14)
Proof.

See detailed proof in Section 10.2

Theorem (4) provides a general bound for any selection of . The bound of is controlled by the distance between and . We then extend Theorem (4) to derive the statistical convergence rate of JEEK. This gives us the following corollary: Suppose the high-dimensional setting, i.e., . Let . Then for and

, with a probability of at least

, the estimated optimal solution has the following error bound:

(15)

where , , and are constants.

Proof.

See detailed proof in Section 10.2.2 (especially from  Eq. (42) to Eq. (50)). ∎

Bayesian View of JEEK:

In Section (9) we provide a direct Bayesian interpretation of JEEK through the perspective of hierarchical Bayesian modeling. Our hierarchical Bayesian interpretation nicely explains the assumptions we make in JEEK.

5 Connecting to Relevant Studies

JEEK is closely related to a few state-of-the-art studies summarized in Table 1. We compare the time complexity and functional properties of JEEK versus these studies.

Method JEEK W-SIMULE JGL FASJEM NAK (run times)
Time Complexity ( if parallelizing completely)
Additional Knowledge YES YES NO NO YES
Table 1: Compare JEEK versus baselines. Here is the number of iterations.
Nak: (Bu & Lederer, 2017)

For the single task sGGM, one recent study (Bu & Lederer, 2017) (following ideas from (Shimamura et al., 2007)) proposed to integrating Additional Knowledge (NAK)into estimation of graphical models through a weighted Neighbourhood selection formulation (NAK) as: . NAK is designed for estimating brain connectivity networks from homogeneous samples and incorporate distance knowledge as weight vectors. 333Here indicates the sparsity of -th column of a single . Namely, if and only if . is a weight vector as the additional knowledge The NAK formulation can be solved by a classic Lasso solver like glmnet. In experiments, we compare JEEK to NAK (by running NAK R package times) on multiple synthetic datasets of simulated samples about brain regions. The data simulation strategy was suggested by  (Bu & Lederer, 2017). Same as the NAK (Bu & Lederer, 2017), we use the spatial distance among brain regions as additional knowledge in JEEK.

W-Simule:  (Singh et al., 2017)

Like JEEK, one recent study  (Singh et al., 2017) of multi-sGGMs (following ideas from (Wang et al., 2017b)) also assumed that and incorporated spatial distance knowledge in their convex formulation for joint discovery of heterogeneous neural connectivity graphs. This study, with name W-SIMULE (Weighted model for Shared and Individual parts of MULtiple graphs Explicitly) uses a weighted constrained minimization:

(16)
Subject to:

W-SIMULE simply includes the additional knowledge as a weight matrix . 444It can be solved by any linear programming solver and can be column-wise paralleled. However, it is very slow when due to the expensive computation cost .

Different from W-SIMULE, JEEK separates the knowledge of individual context and the shared using different weight matrices. While W-SIMULE also minimizes a weighted 1 norm, its constraint optimization term is entirely different from JEEK. The formulation difference makes the optimization of JEEK much faster and more scalable than W-SIMULE (Section (6)). We have provided a complete theoretical analysis of error bounds of JEEK, while W-SIMULE provided no theoretical results. Empirically, we compare JEEK with W-SIMULE R package from  (Singh et al., 2017) in the experiments.

Jgl: (Danaher et al., 2013):

Regularized MLE based multi-sGGMs Studies mostly follow the so called joint graphical lasso (JGL) formulation as  Eq. (17):

(17)

is the second penalty function for enforcing some structural assumption of group property among the multiple graphs. One caveat of JGL is that cannot model explicit additional knowledge. For instance,it can not incorporate the information of a few known hub nodes shared by the contexts. In experiments, we compare JEEK to JGL-co-hub and JGL-perturb-hub toolbox provided by  (Mohan et al., 2013).

Fasjem: (Wang et al., 2017a)

One very recent study extended JGL using so-called Elementary superposition-structured moment estimator formulation as Eq. (18):

(18)

FASJEM is much faster and more scalable than the JGL estimators. However like JGL estimators it can not model additional knowledge and its optimization needs to be carefully re-designed for different . 555FASJEM extends JGL into multiple independent group-entry wise optimization just like JEEK. Here is the dual norm of . Because (Wang et al., 2017a) only designs the optimization of two cases (group,2 and group,inf), we can not use it as a baseline.

Both NAK and W-SIMULE only explored the formulation for estimating neural connectivity graphs using spatial information as additional knowledge. Differently our experiments (Section (6)) extend the weight-as-knowledge formulation on weights as distance, as shared hub knowledge, as perturbed hub knowledge, and as nodes’ grouping information (e.g., multiple genes are known to be in the same pathway). This has largely extends the previous studies in showing the real-world adaptivity of the proposed formulation. JEEK elegantly formulates existing knowledge based on the problem at hand and avoid the need to design knowledge-specific optimization.

6 Experiments

We empirically evaluate JEEK and baselines on four types of datasets, including two groups of synthetic data, one real-world fMRI dataset for brain connectivity estimation and one real-world genomics dataset for estimating interaction among regulatory genes (results in Section (6.2)). In order to incorporating various types of knowledge, we provide five different designs of the weight matrices in Section 12. Details of experimental setup, metrics and hyper-parameter tuning are included in Section (13.1). Baselines used in our experiments have been explained in details by Section (5). We also use JEEK with no additional knowledge (JEEK-NK) as a baseline.

JEEK is available as the R package ’jeek’ in CRAN.

Figure 2: Performance comparison on simulation Datasets using co-Hub Knowledge: AUC vs. Time when varying number of nodes .

6.1 Experiment: Simulated Samples with Known Hubs as Knowledge

Inspired the JGL-co-hub and JGL-perturb-hub toolbox (JGL-node) provided by  (Mohan et al., 2013), we empirically show JEEK’s ability to model known co-hub or perturbed-hub nodes as knowledge when estimating multiple sGGMs. We generate multiple simulated Gaussian datasets through the random graph model (Rothman et al., 2008) to simulate both the co-hub and perturbed-hub graph structures (details in 14.1). We use JGL-node package, W-SIMULE and JEEK-NK as baselines for this set of experiments. The weights in are designed using the strategy proposed in Section (12).

We use AUC score (to reflect the consistency and variance of a method’s performance when varying its important hyper-parameter) and computational time cost to compare JEEK with baselines. We compare all methods on many simulated cases by varying

from the set and the number of tasks from the set . In Figure 2 and Figure 3(a)(b), JEEK consistently achieves higher AUC-scores than the baselines JGL, JEEK-NK and W-SIMULE for all cases. JEEK is more than times faster than the baselines on average. In Figure 2, for each case (with ), W-SIMULE takes more than one month and JGL takes more than one day. Therefore we can not show them with .

6.2 Experiment: Gene Interaction Network from Real-World Genomics Data

Next, we apply JEEK and the baselines on one real-world biomedical data about gene expression profiles across two different cell types. We explored two different types of knowledge: (1) Known edges and (2) Known group about genes. Figure 3(c) shows that JEEK has lower time cost and recovers more interactions than baselines (higher number of matched edges to the existing bio-databases.). More results are in Appendix Section (14.2) and the design of weight matrices for this case is in Section (12).

6.3 Experiment: Simulated Data about Brain Connectivity with Distance as Knowledge

Following (Bu & Lederer, 2017), we use one known Euclidean distance between human brain regions as additional knowledge and use it to generate multiple simulated datasets (details in Section 14.3). We compare JEEK with the baselines regarding (a) Scalability (computational time cost), and (b) effectiveness (F1-score, because NAK package does not allow AUC calculation). For each simulation case, the computation time for each estimator is the summation of a method’s execution time over all values of . Figure 4(a)(b) show clearly that JEEK outperforms its baselines. JEEK has a consistently higher F1-Score and is almost times faster than W-SIMULE in the high dimensional case. JEEK performs better than JEEK-NK, confirming the advantage of integrating additional distance knowledge. While NAK is fast, its F1-Score is nearly and hence, not useful for multi-sGGM structure learning.

6.4 Experiment: Functional Connectivity Estimation from Real-World Brain fMRI Data

We evaluate JEEK and relevant baselines for a classification task on one real-world publicly available resting-state fMRI dataset: ABIDE(Di Martino et al., 2014). The ABIDE data aims to understand human brain connectivity and how it reflects neural disorders (Van Essen et al., 2013). ABIDE includes two groups of human subjects: autism and control, and therefore we formulate it as graph estimation. We utilize the spatial distance between human brain regions as additional knowledge for estimating functional connectivity edges among brain regions. We use Linear Discriminant Analysis (LDA) for a downstream classification task aiming to assess the ability of a graph estimator to learn the differential patterns of the connectome structures. (Details of the ABIDE dataset, baselines, design of the additional knowledge matrix, cross-validation and LDA classification method are in Section (14.4).)

Figure 4(c) compares JEEK and three baselines: JEEK-NK, W-SIMULE and W-SIMULE with no additional knowledge (W-SIMULE-NK). JEEK yields a classification accuracy of 58.62% for distinguishing the autism subjects versus the control subjects, clearly outperforming JEEK-NK and W-SIMULE-NK. JEEK is roughly times faster than the W-SIMULE estimators, locating at the top left region in Figure 4(c) (higher classification accuracy and lower time cost). We also experimented with variations of the matrix and found the classification results are fairly robust to the variations of (Section (14.4)).

7 Conclusions

We propose a novel method, JEEK, to incorporate additional knowledge in estimating multi-sGGMs. JEEK achieves the same asymptotic convergence rate as the state-of-the-art. Our experiments has showcased using weights for describing pairwise knowledge among brain regions, for shared hub knowledge, for perturbed hub knowledge, for describing group information among nodes (e.g., genes known to be in the same pathway), and for using known interaction edges as the knowledge.

Figure 3: (a)(b) Performance comparison on simulation Datasets about hubs: AUC vs. Time when varying number of tasks . (a) is the perturbed hub cases and (b) is for the co-hub cases. (c) Performance comparison on one real-world gene expression dataset with two cell types. Two type knowledge are used to cover one fifth of the nodes, therefore each method corresponds to two performance points.
Figure 4: Experimental Results on Simulated Brain Datasets and on ABIDE. (a) Performance obtained on simulated brain samples with respect to F1-score vs. computational time cost when varying the number of tasks . (b) Performance obtained on simulated brain samples with respect to F1-score vs. computational time cost when varying the number of samples . In both (a) and (b) the smaller box shows an enlarged view comparing JEEK and JEEK-NK points. All JEEK points are in the top left region indicating higher F1-score and lower computational cost. (c). On ABIDE, JEEK outperforms the baseline methods in both classification accuracy and running time cost. JEEK and JEEK-NK points in the top left region and JEEK points are higher in terms of -axis positions.

Acknowledgement: This work was supported by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

Appendix:

8 More about Method

Notations: is the data matrix for the -th task, which includes data samples being described by different feature variables. Then is the total number of data samples. We use notation for the precision matrices and for the estimated covariance matrices. Given a -dimensional vector , we denote the -norm of as . is the -norm of . Similarly, for a matrix , let be the -norm of and be the -norm of .

8.1 More about Solving JEEK

In  Eq. (13), let and . If , then and . If , then and . The and have the similar definition. Then Eq. (13) can be solved by the following small linear programming problem.

8.2 JEEK is Group entry-wise and parallelizing optimizable

JEEK can be easily paralleled. Essentially we just need to revise the “For loop” of step 6 and step 7 in Algorithm 1 into, for instance, “entry per machine” “entry per core”. Now We prove that JEEK is group entry-wise and parallelizing optimizable. We prove that our estimator can be optimized asynchronously in a group entry-wise manner. (JEEK is Group entry-wise optimizable) Suppose we use JEEK to infer multiple inverse of covariance matrices summarized as . . describes a group of entries at position. Varying and , we have a total of groups. If these groups are independently estimated by JEEK, then we have,

(19)
Proof.

Eq. (13) are the small sub-linear programming problems on each group of entries. ∎

8.3 Extending JEEK with Structured Norms

We can add more flexibility into the JEEK by adding structured norms like those second normalization functions used in JGL. This will extend JEEK to the following formulation:

(20)

Here, needs to consider . We propose two ways to solve Eq. (20). (1) The first is to use the parallelized proximal algorithm directly. However, this requires the kw-norm has a closed-form proximity, which has not been discovered. (2) In the second strategy we assume each weighted norm (either or ) in the objective of  Eq. (20) as an indepedent regularizer. However, this increases the number of proximities we need to calculate per iteration to . Both two solutions make the extend-JEEK algorithm less fast or less scalable. Therefore, We choose not to introduce this work in this paper.

9 Connecting to the Bayesian statistics

Our approach has a close connection to a hierarchical Bayesian model perspective. We show that the additional knowledge weight matrices are also the parameters of the prior distribution of . In our formulationEq. (12), are the additional knowledge weight matrices. From a hierarchical Bayesian view, the first level of the prior is a Gaussian distribution and the second level is a Laplace distribution. In the following section, we show that are also the parameters of Laplace distributions, which is a prior distribution of .

Since by the definition, . There are only two possible situations:

Case I ():

(21)
(22)
(23)

Here follows a Laplace distribution with mean . is the diversity parameter. The larger is, the distribution of more likely concentrate on the . Namely, there will be the higher density for .

Case II ():

(24)
(25)
(26)

Here follows a Laplace distribution with mean . is the diversity parameter. The larger is, the distribution of more likely concentrate on the . Namely, there will be the higher density for .

Therefore, we can combine the above two cases into the following one equation.

(27)

Our final hierarchical Bayesian formulation consists of the Eq. (21) and Eq. (27). This model is a generalization of the model considered in the seminal paper on the Bayesian lasso(Park & Casella, 2008). The parameters in our general model are hyper-parameters that specify the shape of the prior distribution of each edges in . The negative log-posterior distribution of is now given by:

(28)

Eq. (28) follows a weighted variation of Eq. (1).

10 More about Theoretical Analysis

10.1 Theorems and Proofs of three properties of kw-norm

In this sub-section, we prove the three properties of kw-norm used in Section 3.2. We then provide the convergence rate of our estimator based on these three properties.

  • (i) kw-norm is a norm function if and only if any entries in and do not equal to .

  • (ii) If the condition in (i) holds, kw-norm is a decomposable norm.

  • (iii) If the condition in (i) holds, the dual norm of kw-norm is .

10.1.1 Norm:

First we prove the correctness of the argument that kw-norm is a norm function by the following theorem:  Eq. (11) is a norm if and only if , and . This theorem gives the sufficient and necessary conditions to make kw-norm ( Eq. (11)) a norm function.

10.1.2 Decomposable Norm:

Then we show that kw-norm is a decomposable norm within a certain subspace. Before providing the theorem, we give the structural assumption of the parameter.

(IS-Sparsity): The ’true’ parameter for ( multiple GGM structures) can be decomposed into two clear structures– and . is exactly sparse with non-zero entries indexed by a support set and is exactly sparse with non-zero entries indexed by a support set . . All other elements equal to (in ). (IS-subspace)

(29)

Eq. (11) is a decomposable norm with respect to and

10.1.3 Dual Norm of kw-Norm:

To obtain the final formulation Eq. (12) and its statistical convergence rate, we need to derive the dual norm formulation of kw-norm. The dual norm of kw-norm ( Eq. (11)) is

(30)

The details of the proof are as follows.

10.1.4 Proof of Theorem (10.1.1)

For kw-norm, and equals to and .

Proof.

If , then . Notice that . ∎

Proof.

To prove the kw-norm is a norm, by Lemma (11.3) the only thing we need to prove is that is a norm function if . 1. . 2. . 3. 4. If , then . Since , . Therefore, . Based on the above, is a norm function. Since summation of norm is still a norm function, kw-norm is a norm function. ∎

Furthermore, we have the following Lemma: The dual norm of is

.

Proof.
(31)
(32)
(33)

10.1.5 Proof of Theorem (10.1.2)

Proof.

Assume and , . Therefore, kw-norm is a decomposable norm with respect to the subspace pair . ∎

10.1.6 Proof of Theorem (10.1.3)

Proof.

Suppose , where . Then the dual norm can be derived by the following equation.