1 Introduction
Technology revolutions in the past decade have collected largescale heterogeneous samples from many scientific domains. For instance, genomic technologies have delivered petabytes of molecular measurements across more than hundreds of types of cells and tissues from national projects like ENCODE (Consortium et al., 2012) and TCGA (Network et al., 2011). Neuroimaging technologies have generated petabytes of functional magnetic resonance imaging (fMRI) datasets across thousands of human subjects (shared publicly through projects like openfMRI (Poldrack et al., 2013). Given such data, understanding and quantifying variable graphs from heterogeneous samples about multiple contexts is a fundamental analysis task.
Such variable graphs can significantly simplify networkdriven studies about diseases (Ideker & Krogan, 2012), can help understand the neural characteristics underlying clinical disorders (Uddin et al., 2013) and can allow for understanding genetic or neural pathways and systems. The number of contexts (denoted as ) that those applications need to consider grows extremely fast, ranging from tens (e.g., cancer types in TCGA (Network et al., 2011)) to thousands (e.g., number of subjects in openfMRI (Poldrack et al., 2013)). The number of variables (denoted as ) ranges from hundreds (e.g., number of brain regions) to tens of thousands (e.g., number of human genes).
The above data analysis problem can be formulated as jointly estimating conditional dependency graphs on a single set of variables based on heterogeneous samples accumulated from distinct contexts. For homogeneous data samples from a given th context, one typical approach is the sparse Gaussian Graphical Model (sGGM) (Lauritzen, 1996; Yuan & Lin, 2007). sGGM assumes samples are independently and identically drawn from
, a multivariate Gaussian distribution with mean vector
and covariance matrix . The graph structure is encoded by the sparsity pattern of the inverse covariance matrix, also named precision matrix, . . if and only if in an edge does not connect th node and th node (i.e., conditional independent). sGGM imposes an penalty on the parameter to achieve a consistent estimation under highdimensional situations. When handling heterogeneous data samples, rather than estimating sGGM of each condition separately, a multitask formulation that jointly estimates different but related sGGMs can lead to a better generalization(Caruana, 1997).Previous studies for joint estimation of multiple sGGMs roughly fall into four categories: (Danaher et al., 2013; Mohan et al., 2013; Chiquet et al., 2011; Honorio & Samaras, 2010; Guo et al., 2011; Zhang & Wang, 2012; Zhang & Schneider, 2010; Zhu et al., 2014): (1) The first group seeks to optimize a sparsity regularized data likelihood function plus an extra penalty function to enforce structural similarity among multiple estimated networks. Joint graphical lasso (JGL) (Danaher et al., 2013) proposed an alternating direction method of multipliers (ADMM) based optimization algorithm to work with two regularization functions (). (2) The second category tries to recover the support of using sparsity penalized regressions in a column by column fashion. Recently (Monti et al., 2015) proposed to learn population and subjectspecific brain connectivity networks via a socalled “Mixed Neighborhood Selection” (MSN) method in this category. (3) The third type of methods seeks to minimize the joint sparsity of the target precision matrices under matrix inversion constraints. One recent study, named SIMULE (Shared and Individual parts of MULtiple graphs Explicitly) (Wang et al., 2017b), automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts (i.e. by modeling each precision matrix as ) via the constrained minimization. Following the CLIME estimator (Pang et al., 2014), the constrained
convex formulation can also be solved column by column via linear programming. However, all three categories of aforementioned estimators are difficult to scale up when the dimension
or the number of tasks are large because they cannot avoid expensive steps like SVD (Danaher et al., 2013) for JGL, linear programming for SIMULE or running multiple iterations of expensive penalized regressions in MNS. (4) The last category extends the socalled ”Elementary Estimator” graphical model (EEGM) formulation (Yang et al., 2014b) to revise JGL’s penalized likelihood into a constrained convex program that minimizes (). One proposed estimator FASJEM (Wang et al., 2017a) is solved in an entrywise manner and groupentrywise manner that largely outperforms the speed of its JGL counterparts. More details of the related works are in Section (5).One significant caveat of stateoftheart joint sGGM estimators is the fact that little attention has been paid to incorporating existing knowledge of the nodes or knowledge of the relationships among nodes in the models. In addition to the samples themselves, additional information is widely available in realworld applications. In fact, incorporating the knowledge is of great scientific interest. A prime example is when estimating the functional brain connectivity networks among brain regions based on fMRI samples, the spatial position of the regions are readily available. Neuroscientists have gathered considerable knowledge regarding the spatial and anatomical evidence underlying brain connectivity (e.g., short edges and certain anatomical regions are more likely to be connected (Watts & Strogatz, 1998)). Another important example is the problem of identifying genegene interactions from patients’ gene expression profiles across multiple cancer types. Learning the statistical dependencies among genes from such heterogeneous datasets can help to understand how such dependencies vary from normal to abnormal and help to discover contributing markers that influence or cause the diseases. Besides the patient samples, stateoftheart biodatabases like HPRD (Prasad et al., 2009) have collected a significant amount of information about direct physical interactions among corresponding proteins, regulatory gene pairs or signaling relationships collected from highqualify bioexperiments.
Although being strong evidence of structural patterns we aim to discover, this type of information has rarely been considered in the joint sGGM formulation of such samples. To the authors’ best knowledge, only one study named as WSIMULE tried to extend the constrained minimization in SIMULE into weighted for considering spatial information of brain regions in the joint discovery of heterogeneous neural connectivity graphs (Singh et al., 2017). This method was designed just for the neuroimaging samples and has time cost, making it not scalable for largescale settings (more details in Section 3).
This paper aims to fill this gap by adding additional knowledge most effectively into scalable and fast joint sGGM estimations. We propose a novel model, namely Joint Elementary Estimator incorporating additional Knowledge (JEEK), that presents a principled and scalable strategy to include additional knowledge when estimating multiple related sGGMs jointly. Briefly speaking, this paper makes the following contributions:

[noitemsep]

Novel approach: JEEK presents a new way of integrating additional knowledge in learning multitask sGGMs in a scalable way. (Section 3)

Fast optimization: We optimize JEEK through an entrywise and groupentrywise manner that can dramatically improve the time complexity to . (Section 3.4)

Convergence rate: We theoretically prove the convergence rate of JEEK as . This rate shows the benefit of joint estimation and achieves the same convergence rate as the stateoftheart that are much harder to compute. (Section 4)

Evaluation: We evaluate JEEK using several synthetic datasets and two realworld data, one from neuroscience and one from genomics. It outperforms stateoftheart baselines significantly regarding the speed. (Section 6)
JEEK provides the flexibility of using () different weight matrices representing the extra knowledge. We try to showcase a few possible designs of the weight matrices in Section 12, including (but not limited to):

Spatial or anatomy knowledge about brain regions;

Knowledge of known cohub nodes or perturbed nodes;

Known group information about nodes, such as genes belonging to the same biological pathway or cellular location;

Using existing known edges as the knowledge, like the known protein interaction databases for discovering gene networks (a semisupervised setting for such estimations).
We sincerely believe the scalability and flexibility provided by JEEK can make structure learning of joint sGGM feasible in many realworld tasks.
Att: Due to space limitations, we have put details of certain contents (e.g., proofs) in the appendix. Notations with “S:” as the prefix in the numbering mean the corresponding contents are in the appendix. For example, full proofs are in Section (10).
Notations: math notations we use are described in Section (8). is the total number of data samples. The Hadamard product is the elementwise product between two matrices. Also to simplify the notations, we abuse the notation to represent a new matrix being generated by element wise division of each entry in by .
2 Background
Sparse Gaussian graphical model (sGGM):
The classic formulation of estimating sparse Gaussian Graphical model (Yuan & Lin, 2007) from a single given condition (single sGGM) is the “graphical lasso” estimator (GLasso) (Yuan & Lin, 2007; Banerjee et al., 2008). It solves the following penalized maximum likelihood estimation (MLE) problem:
(1) 
MEstimator with Decomposable Regularizer in HighDimensional Situations:
Recently the seminal study (Negahban et al., 2009) proposed a unified framework for highdimensional analysis of the following general formulation: Mestimators with decomposable regularizers:
(2) 
where represents a decomposable regularization function and
represents a loss function (e.g., the negative loglikelihood function in sGGM
). Here is the tuning parameter.Elementary Estimators (EE):
Using the analysis framework from (Negahban et al., 2009), recent studies (Yang et al., 2014a, b, c) propose a new category of estimators named “Elementary estimator” (EE) with the following general formulation:
(3) 
Where is the dual norm of ,
(4) 
The solution of Eq. (35) achieves the near optimal convergence rate as Eq. (2) when satisfying certain conditions. represents a decomposable regularization function (e.g., norm) and is the dual norm of (e.g., norm is the dual norm of norm). is a regularization parameter.
The basic motivation of Eq. (35) is to build simpler and possibly fast estimators, that yet come with statistical guarantees that are nonetheless comparable to regularized MLE. needs to be carefully constructed, welldefined and closedform for the purpose of simpler computations. The formulation defined by Eq. (35) is to ensure its solution having the desired structure defined by
. For cases of highdimensional estimation of linear regression models,
can be the classical ridge estimator that itself is closedform and with strong statistical convergence guarantees in highdimensional situations.EEsGGM:
(Yang et al., 2014b) proposed elementary estimators for graphical models (GM) of exponential families, in which represents socalled proxy of backward mapping for the target GM (more details in Section 11). The key idea (summarized in the upper row of Figure 1) is to investigate the vanilla MLE and where it “breaks down” for estimating a graphical model of exponential families in the case of highdimensions (Yang et al., 2014b)
. Essentially the vanilla graphical model MLE can be expressed as a backward mapping that computes the model parameters from some given moments in an exponential family distribution. For instance, in the case of learning Gaussian GM (GGM) with vanilla MLE, the backward mapping is
that estimates from the sample covariance matrix (moment) . We introduce the details of backward mapping in Section 11.However, even though this backward mapping has a simple closed form for GGM, the backward mapping is normally not welldefined in highdimensional settings. When given the sample covariance , we cannot just compute the vanilla MLE solution as for GGM since is rankdeficient when . Therefore Yang et al. (Yang et al., 2014b) used carefully constructed proxy backward maps as that is both available in closedform, and welldefined in highdimensional settings for GGMs. We introduce the details of and its statistical property in Section 11. Now Eq. (35) becomes the following closedform estimator for learning sparse Gaussian graphical models (Yang et al., 2014b):
(5) 
Eq. (5) is a special case of Eq. (35), in which is the offdiagonal norm and the precision matrix is the we search for. When is the norm, the solution of Eq. (35) (and Eq. (5)) just needs to perform entrywise thresholding operations on to ensure the desired sparsity structure of its final solution.
3 Proposed Method: JEEK
In applications of Gaussian graphical models, we typically have more information than just the data samples themselves. This paper aims to propose a simple, scalable and theoreticallyguaranteed joint estimator for estimating multiple sGGMs with additional knowledge in largescale situations.
3.1 A Joint EE (JEE) Formulation
We first propose to jointly estimate multiple related sGGMs from data blocks using the following formulation:
(6) 
where denotes the precision matrix for th task. describes the negative loglikelihood function in sGGM. means that needs to be a positive definite matrix. represents a decomposable regularization function enforcing sparsity and structure assumptions (details in Section (3.2)).
For ease of notation, we denote that and . and are both matrices (i.e., parameters to estimate). Now define an inverse function as , where is a given matrix with the same structure as . Then we rewrite Eq. (6) into the following form:
(7) 
Now connecting Eq. (7) to Eq. (2) and Eq. (35), we propose the following joint elementary estimator (JEE) for learning multiple sGGMs:
(8) 
The fundamental component in Eq. (35) for the single context sGGM was to use a welldefined proxy function to approximate the vanilla MLE solution (named as the backward mapping for exponential family distributions) (Yang et al., 2014b). The proposed proxy is both welldefined under highdimensional situations and also has a simple closedform. Following a similar idea, when learning multiple sGGMs, we propose to use for and get the following joint elementary estimator:
(9) 
3.2 Knowledge as Weight (KWNorm)
The main goal of this paper is to design a principled strategy to incorporate existing knowledge (other than samples or structured assumptions) into the multisGGM formulation. We consider two factors in such a design:
(1) When learning multiple sGGMs jointly from realworld applications, it is often of great scientific interests to model and learn contextspecific graph variations explicitly, because such variations can “fingerprint” important markers in domains like cognition (Ideker & Krogan, 2012) or pathology (Kelly et al., 2012). Therefore we design to share parameters between different contexts. Mathematically, we model as two parts:
(10) 
where is the individual precision matrix for context and is the shared precision matrix between contexts. Again, for ease of notation we denote and .
(2) We represent additional knowledge as positive weight matrices from . More specifically, we represent the knowledge of the taskspecific graph as weight matrix and representing existing knowledge of the shared network. The positive matrixbased representation is a powerful and flexible strategy that can describe many possible forms of existing knowledge. In Section (12), we provide a few different designs of and for realworld applications. In total, we have weight matrices to represent additional knowledge. To simplify notations, we denote and .
Now we propose the following knowledge as weight norm (kwnorm) combining the above two:
(11) 
Here the Hadamard product is the elementwise product between two matrices i.e. .
The kwnorm( Eq. (11)) has the following three properties:

(i) kwnorm is a norm function if and only if any entries in and do not equal to .

(ii) If the condition in (i) holds, kwnorm is a decomposable norm.

(iii) If the condition (i) holds, the dual norm of kwnorm is ^{2}^{2}2In our previous version, we mistakenly wrote . We correct relevant equations here. Our code implementation was correct, thus not change.
Section 10.1 provides proofs of the above claims.
3.3 JEE with Knowledge (JEEK)
Plugging Eq. (11) to Eq. (9), we obtain the following formulation of JEEK for learning multiple related sGGMs from heterogeneous samples:
(12) 
In Section 4
, we theoretically prove that the statistical convergence rate of JEEK achieves the same sharp convergence rate as the stateoftheart estimators for multitask sGGMs. Our proofs are inspired by the unified framework of the highdimensional statistics
(Negahban et al., 2009).3.4 Solution of JEEK:
A huge computational advantage of JEEK (Eq. (12)) is that it can be decomposed into independent small linear programming problems. To simplify notations, we denote (the th entry of ) as . Similarly we denote as and be . Similarly we denote and . ”A group of entries” means a set of parameters for certain .
In order to estimate , JEEK (Eq. (12)) can be decomposed into the following formulation for a certain :
(13) 
Eq. (13) can be easily converted into a linear programming form of Eq. (8.1) with only variables. The time complexity of Eq. (13) is . Considering JEEK has a total of such subproblems to solve, the computational complexity of JEEK (Eq. (12)) is therefore . We summarize the optimization algorithm of JEEK in Algorithm 1 (details in Section (8.2)).
4 Theoretical Analysis
KWNorm:
Theoretical error bounds of Proxy Backward Mapping:
(Yang et al., 2014b) proved that when (), the proxy backward mapping used by EEsGGM achieves the sharp convergence rate to its truth (i.e., by proving ). The proof was extended from the previous study (Rothman et al., 2009) that devised for estimating covariance matrix consistently in highdimensional situations. See detailed proofs in Section 11.3. To derive the statistical error bound of JEEK, we need to assume that are welldefined. This is ensured by assuming that the true satisfy the conditions defined in Section (10.1).
Theoretical error bounds of JEEK:
We now use the highdimensional analysis framework from (Negahban et al., 2009), three properties of kwnorm, and error bounds of backward mapping from (Rothman et al., 2009; Yang et al., 2014b) to derive the statistical convergence rates of JEEK. Detailed proofs of the following theorems are in Section 4 .
Before providing the theorem, we need to define the structural assumption, the ISSparsity, we assume for the parameter truth.
(ISSparsity): The ’true’ parameter of can be decomposed into two clear structures– and . is exactly sparse with nonzero entries indexed by a support set and is exactly sparse with nonzero entries indexed by a support set . . All other elements equal to (in ).
Consider whose true parameter satisfies the (ISSparsity) assumption. Suppose we compute the solution of Eq. (12) with a bounded such that , then the estimated solution satisfies the following error bounds:
(14) 
Proof.
See detailed proof in Section 10.2 ∎
Theorem (4) provides a general bound for any selection of . The bound of is controlled by the distance between and . We then extend Theorem (4) to derive the statistical convergence rate of JEEK. This gives us the following corollary: Suppose the highdimensional setting, i.e., . Let . Then for and
, with a probability of at least
, the estimated optimal solution has the following error bound:(15) 
where , , and are constants.
Bayesian View of JEEK:
In Section (9) we provide a direct Bayesian interpretation of JEEK through the perspective of hierarchical Bayesian modeling. Our hierarchical Bayesian interpretation nicely explains the assumptions we make in JEEK.
5 Connecting to Relevant Studies
JEEK is closely related to a few stateoftheart studies summarized in Table 1. We compare the time complexity and functional properties of JEEK versus these studies.
Method  JEEK  WSIMULE  JGL  FASJEM  NAK (run times) 

Time Complexity  ( if parallelizing completely)  
Additional Knowledge  YES  YES  NO  NO  YES 
Nak: (Bu & Lederer, 2017)
For the single task sGGM, one recent study (Bu & Lederer, 2017) (following ideas from (Shimamura et al., 2007)) proposed to integrating Additional Knowledge (NAK)into estimation of graphical models through a weighted Neighbourhood selection formulation (NAK) as: . NAK is designed for estimating brain connectivity networks from homogeneous samples and incorporate distance knowledge as weight vectors. ^{3}^{3}3Here indicates the sparsity of th column of a single . Namely, if and only if . is a weight vector as the additional knowledge The NAK formulation can be solved by a classic Lasso solver like glmnet. In experiments, we compare JEEK to NAK (by running NAK R package times) on multiple synthetic datasets of simulated samples about brain regions. The data simulation strategy was suggested by (Bu & Lederer, 2017). Same as the NAK (Bu & Lederer, 2017), we use the spatial distance among brain regions as additional knowledge in JEEK.
WSimule: (Singh et al., 2017)
Like JEEK, one recent study (Singh et al., 2017) of multisGGMs (following ideas from (Wang et al., 2017b)) also assumed that and incorporated spatial distance knowledge in their convex formulation for joint discovery of heterogeneous neural connectivity graphs. This study, with name WSIMULE (Weighted model for Shared and Individual parts of MULtiple graphs Explicitly) uses a weighted constrained minimization:
(16)  
Subject to: 
WSIMULE simply includes the additional knowledge as a weight matrix . ^{4}^{4}4It can be solved by any linear programming solver and can be columnwise paralleled. However, it is very slow when due to the expensive computation cost .
Different from WSIMULE, JEEK separates the knowledge of individual context and the shared using different weight matrices. While WSIMULE also minimizes a weighted 1 norm, its constraint optimization term is entirely different from JEEK. The formulation difference makes the optimization of JEEK much faster and more scalable than WSIMULE (Section (6)). We have provided a complete theoretical analysis of error bounds of JEEK, while WSIMULE provided no theoretical results. Empirically, we compare JEEK with WSIMULE R package from (Singh et al., 2017) in the experiments.
Jgl: (Danaher et al., 2013):
Regularized MLE based multisGGMs Studies mostly follow the so called joint graphical lasso (JGL) formulation as Eq. (17):
(17) 
is the second penalty function for enforcing some structural assumption of group property among the multiple graphs. One caveat of JGL is that cannot model explicit additional knowledge. For instance,it can not incorporate the information of a few known hub nodes shared by the contexts. In experiments, we compare JEEK to JGLcohub and JGLperturbhub toolbox provided by (Mohan et al., 2013).
Fasjem: (Wang et al., 2017a)
One very recent study extended JGL using socalled Elementary superpositionstructured moment estimator formulation as Eq. (18):
(18) 
FASJEM is much faster and more scalable than the JGL estimators. However like JGL estimators it can not model additional knowledge and its optimization needs to be carefully redesigned for different . ^{5}^{5}5FASJEM extends JGL into multiple independent groupentry wise optimization just like JEEK. Here is the dual norm of . Because (Wang et al., 2017a) only designs the optimization of two cases (group,2 and group,inf), we can not use it as a baseline.
Both NAK and WSIMULE only explored the formulation for estimating neural connectivity graphs using spatial information as additional knowledge. Differently our experiments (Section (6)) extend the weightasknowledge formulation on weights as distance, as shared hub knowledge, as perturbed hub knowledge, and as nodes’ grouping information (e.g., multiple genes are known to be in the same pathway). This has largely extends the previous studies in showing the realworld adaptivity of the proposed formulation. JEEK elegantly formulates existing knowledge based on the problem at hand and avoid the need to design knowledgespecific optimization.
6 Experiments
We empirically evaluate JEEK and baselines on four types of datasets, including two groups of synthetic data, one realworld fMRI dataset for brain connectivity estimation and one realworld genomics dataset for estimating interaction among regulatory genes (results in Section (6.2)). In order to incorporating various types of knowledge, we provide five different designs of the weight matrices in Section 12. Details of experimental setup, metrics and hyperparameter tuning are included in Section (13.1). Baselines used in our experiments have been explained in details by Section (5). We also use JEEK with no additional knowledge (JEEKNK) as a baseline.
JEEK is available as the R package ’jeek’ in CRAN.
6.1 Experiment: Simulated Samples with Known Hubs as Knowledge
Inspired the JGLcohub and JGLperturbhub toolbox (JGLnode) provided by (Mohan et al., 2013), we empirically show JEEK’s ability to model known cohub or perturbedhub nodes as knowledge when estimating multiple sGGMs. We generate multiple simulated Gaussian datasets through the random graph model (Rothman et al., 2008) to simulate both the cohub and perturbedhub graph structures (details in 14.1). We use JGLnode package, WSIMULE and JEEKNK as baselines for this set of experiments. The weights in are designed using the strategy proposed in Section (12).
We use AUC score (to reflect the consistency and variance of a method’s performance when varying its important hyperparameter) and computational time cost to compare JEEK with baselines. We compare all methods on many simulated cases by varying
from the set and the number of tasks from the set . In Figure 2 and Figure 3(a)(b), JEEK consistently achieves higher AUCscores than the baselines JGL, JEEKNK and WSIMULE for all cases. JEEK is more than times faster than the baselines on average. In Figure 2, for each case (with ), WSIMULE takes more than one month and JGL takes more than one day. Therefore we can not show them with .6.2 Experiment: Gene Interaction Network from RealWorld Genomics Data
Next, we apply JEEK and the baselines on one realworld biomedical data about gene expression profiles across two different cell types. We explored two different types of knowledge: (1) Known edges and (2) Known group about genes. Figure 3(c) shows that JEEK has lower time cost and recovers more interactions than baselines (higher number of matched edges to the existing biodatabases.). More results are in Appendix Section (14.2) and the design of weight matrices for this case is in Section (12).
6.3 Experiment: Simulated Data about Brain Connectivity with Distance as Knowledge
Following (Bu & Lederer, 2017), we use one known Euclidean distance between human brain regions as additional knowledge and use it to generate multiple simulated datasets (details in Section 14.3). We compare JEEK with the baselines regarding (a) Scalability (computational time cost), and (b) effectiveness (F1score, because NAK package does not allow AUC calculation). For each simulation case, the computation time for each estimator is the summation of a method’s execution time over all values of . Figure 4(a)(b) show clearly that JEEK outperforms its baselines. JEEK has a consistently higher F1Score and is almost times faster than WSIMULE in the high dimensional case. JEEK performs better than JEEKNK, confirming the advantage of integrating additional distance knowledge. While NAK is fast, its F1Score is nearly and hence, not useful for multisGGM structure learning.
6.4 Experiment: Functional Connectivity Estimation from RealWorld Brain fMRI Data
We evaluate JEEK and relevant baselines for a classification task on one realworld publicly available restingstate fMRI dataset: ABIDE(Di Martino et al., 2014). The ABIDE data aims to understand human brain connectivity and how it reflects neural disorders (Van Essen et al., 2013). ABIDE includes two groups of human subjects: autism and control, and therefore we formulate it as graph estimation. We utilize the spatial distance between human brain regions as additional knowledge for estimating functional connectivity edges among brain regions. We use Linear Discriminant Analysis (LDA) for a downstream classification task aiming to assess the ability of a graph estimator to learn the differential patterns of the connectome structures. (Details of the ABIDE dataset, baselines, design of the additional knowledge matrix, crossvalidation and LDA classification method are in Section (14.4).)
Figure 4(c) compares JEEK and three baselines: JEEKNK, WSIMULE and WSIMULE with no additional knowledge (WSIMULENK). JEEK yields a classification accuracy of 58.62% for distinguishing the autism subjects versus the control subjects, clearly outperforming JEEKNK and WSIMULENK. JEEK is roughly times faster than the WSIMULE estimators, locating at the top left region in Figure 4(c) (higher classification accuracy and lower time cost). We also experimented with variations of the matrix and found the classification results are fairly robust to the variations of (Section (14.4)).
7 Conclusions
We propose a novel method, JEEK, to incorporate additional knowledge in estimating multisGGMs. JEEK achieves the same asymptotic convergence rate as the stateoftheart. Our experiments has showcased using weights for describing pairwise knowledge among brain regions, for shared hub knowledge, for perturbed hub knowledge, for describing group information among nodes (e.g., genes known to be in the same pathway), and for using known interaction edges as the knowledge.
Acknowledgement: This work was supported by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
Appendix:
8 More about Method
Notations: is the data matrix for the th task, which includes data samples being described by different feature variables. Then is the total number of data samples. We use notation for the precision matrices and for the estimated covariance matrices. Given a dimensional vector , we denote the norm of as . is the norm of . Similarly, for a matrix , let be the norm of and be the norm of .
8.1 More about Solving JEEK
8.2 JEEK is Group entrywise and parallelizing optimizable
JEEK can be easily paralleled. Essentially we just need to revise the “For loop” of step 6 and step 7 in Algorithm 1 into, for instance, “entry per machine” “entry per core”. Now We prove that JEEK is group entrywise and parallelizing optimizable. We prove that our estimator can be optimized asynchronously in a group entrywise manner. (JEEK is Group entrywise optimizable) Suppose we use JEEK to infer multiple inverse of covariance matrices summarized as . . describes a group of entries at position. Varying and , we have a total of groups. If these groups are independently estimated by JEEK, then we have,
(19) 
Proof.
Eq. (13) are the small sublinear programming problems on each group of entries. ∎
8.3 Extending JEEK with Structured Norms
We can add more flexibility into the JEEK by adding structured norms like those second normalization functions used in JGL. This will extend JEEK to the following formulation:
(20) 
Here, needs to consider . We propose two ways to solve Eq. (20). (1) The first is to use the parallelized proximal algorithm directly. However, this requires the kwnorm has a closedform proximity, which has not been discovered. (2) In the second strategy we assume each weighted norm (either or ) in the objective of Eq. (20) as an indepedent regularizer. However, this increases the number of proximities we need to calculate per iteration to . Both two solutions make the extendJEEK algorithm less fast or less scalable. Therefore, We choose not to introduce this work in this paper.
9 Connecting to the Bayesian statistics
Our approach has a close connection to a hierarchical Bayesian model perspective. We show that the additional knowledge weight matrices are also the parameters of the prior distribution of . In our formulationEq. (12), are the additional knowledge weight matrices. From a hierarchical Bayesian view, the first level of the prior is a Gaussian distribution and the second level is a Laplace distribution. In the following section, we show that are also the parameters of Laplace distributions, which is a prior distribution of .
Since by the definition, . There are only two possible situations:
Case I ():
(21) 
(22) 
(23) 
Here follows a Laplace distribution with mean . is the diversity parameter. The larger is, the distribution of more likely concentrate on the . Namely, there will be the higher density for .
Case II ():
(24) 
(25) 
(26) 
Here follows a Laplace distribution with mean . is the diversity parameter. The larger is, the distribution of more likely concentrate on the . Namely, there will be the higher density for .
Therefore, we can combine the above two cases into the following one equation.
(27) 
Our final hierarchical Bayesian formulation consists of the Eq. (21) and Eq. (27). This model is a generalization of the model considered in the seminal paper on the Bayesian lasso(Park & Casella, 2008). The parameters in our general model are hyperparameters that specify the shape of the prior distribution of each edges in . The negative logposterior distribution of is now given by:
(28) 
10 More about Theoretical Analysis
10.1 Theorems and Proofs of three properties of kwnorm
In this subsection, we prove the three properties of kwnorm used in Section 3.2. We then provide the convergence rate of our estimator based on these three properties.

(i) kwnorm is a norm function if and only if any entries in and do not equal to .

(ii) If the condition in (i) holds, kwnorm is a decomposable norm.

(iii) If the condition in (i) holds, the dual norm of kwnorm is .
10.1.1 Norm:
10.1.2 Decomposable Norm:
Then we show that kwnorm is a decomposable norm within a certain subspace. Before providing the theorem, we give the structural assumption of the parameter.
(ISSparsity): The ’true’ parameter for ( multiple GGM structures) can be decomposed into two clear structures– and . is exactly sparse with nonzero entries indexed by a support set and is exactly sparse with nonzero entries indexed by a support set . . All other elements equal to (in ). (ISsubspace)
(29) 
Eq. (11) is a decomposable norm with respect to and
10.1.3 Dual Norm of kwNorm:
10.1.4 Proof of Theorem (10.1.1)
For kwnorm, and equals to and .
Proof.
If , then . Notice that . ∎
Proof.
To prove the kwnorm is a norm, by Lemma (11.3) the only thing we need to prove is that is a norm function if . 1. . 2. . 3. 4. If , then . Since , . Therefore, . Based on the above, is a norm function. Since summation of norm is still a norm function, kwnorm is a norm function. ∎
Furthermore, we have the following Lemma: The dual norm of is
.
Proof.
(31)  
(32)  
(33) 
∎
10.1.5 Proof of Theorem (10.1.2)
Proof.
Assume and , . Therefore, kwnorm is a decomposable norm with respect to the subspace pair . ∎
10.1.6 Proof of Theorem (10.1.3)
Proof.
Suppose , where . Then the dual norm can be derived by the following equation.
Comments
There are no comments yet.