High-dimensional Gaussian graphical model for network-linked data

07/04/2019 ∙ by Tianxi Li, et al. ∙ 6

Graphical models are commonly used to represent conditional dependence relationships between variables. There are multiple methods available for exploring them from high-dimensional data, but almost all of them rely on the assumption that the observations are independent and identically distributed. At the same time, observations connected by a network are becoming increasingly common, and tend to violate these assumptions. Here we develop a Gaussian graphical model for observations connected by a network with potentially different mean vectors, varying smoothly over the network. We propose an efficient estimation algorithm and demonstrate its effectiveness on both simulated and real data, obtaining meaningful interpretable results on a statistics coauthorship network. We also prove that our method estimates both the inverse covariance matrix and the corresponding graph structure correctly under the assumption of network "cohesion", which refers to the empirically observed phenomenon of network neighbors sharing similar traits.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

GNC

Gaussian graphical model with network cohesion


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Network data represent information about relationships (edges) between units (nodes), such as friendships or collaborations, and are often collected together with more “traditional” covariates that describe one unit at a time. In a social network, edges may represent friendships between people (nodes), and traditional covariates could be their demographic characteristics such as gender, race, age, and so on. Incorporating relational information in statistical modeling tasks focused on “traditional” node covariates should improve performance, since it offers additional information, but most traditional multivariate analysis methods are not designed to use such information. In fact, most such methods for regression, clustering, density estimation and so on tend to assume the sampled units are homogeneous, typically in a marginally independent and identical distribution (i.i.d.) manner, which is unlikely to be the case for units connected by a network. While there is a fair amount of earlier work on incorporating such information into specific settings

(Manski, 1993; Lee, 2007; Yang et al., 2011; Raducanu and Dornaika, 2012; Vural and Guillemot, 2016), work on extending standard statistical methods to network-linked data has only recently started appearing, for example, Li et al. (2019) for regression, Tang et al. (2013) for classification, and Yang et al. (2013), Binkiewicz et al. (2017) for clustering. Our goal in this paper is to develop an analog to the widely used Gaussian graphical models for network-linked data, with the goal of taking advantage of this additional information to improve performance when possible.

Graphical models are commonly used to represent independence relationships between random variables, with each variable corresponding to a node, and edges representing conditional or marginal dependence between the two random variables. Note that a graphical model is a graph connecting variables, as opposed to the networks discussed above, which are graphs connecting observations. Graphical models have been widely studied in statistics and machine learning and have applications in bioinformatics, text mining and causal inference, among others. The Gaussian graphical model belongs to the family of undirected graphical models, or Markov random fields, and assumes the variables are jointly Gaussian. Specifically, the conventional Gaussian graphical model for a data matrix

assumes that the rows , , are independently drawn from the same

-variate normal distribution

. This vastly simplifies analysis, since for the Gaussian distribution all marginal dependence information is contained in the covariance matrix, and all conditional independence information in its inverse. In particular, random variables

and are conditionally independent given the rest if and only if the -th entry of the inverse covariance matrix (the precision matrix) is zero. Therefore estimating the graph for a Gaussian graphical model is equivalent to identifying zeros in the precision matrix, and this problem has been well studied, in both the low-dimensional and the high-dimensional settings. A pioneering paper by Meinshausen and Bühlmann (2006) proposed neighborhood selection, which learns edges by regressing each variable on all the others via lasso, and established good asymptotic properties in high dimensions. Many penalized likelihood methods have been proposed as well (Yuan and Lin, 2007; Banerjee et al., 2008; Rothman et al., 2008; d’Aspremont et al., 2008; Friedman et al., 2008). In particular, the graphical lasso (glasso) algorithm of Friedman et al. (2008) and its subsequent improvements (Witten et al., 2011; Hsieh et al., 2013a) are widely used to solve the problem efficiently.

The penalized likelihood approach to Gaussian graphical models assumes the observations are i.i.d., a restrictive assumption in many real-world situations. This assumption was relaxed in Zhou et al. (2010); Guo et al. (2011) and Danaher et al. (2014) by allowing the covariance matrix to vary smoothly over time or across groups, while the mean vector remains constant. A special case of modeling the mean vector on additional covariates associated with each observation has also been studied (Rothman et al., 2010; Yin and Li, 2011; Lee and Liu, 2012; Cai et al., 2013; Lin et al., 2016). Neither of these relaxations are easy to adapt to network data, and their assumptions are hard to verify in practice.

In this paper, we consider the problem of estimating a graphical model with heterogeneous mean vectors when a network connecting the observations is available. For example, in analyzing word frequencies in research papers, the conditional dependencies between words may represent certain universal phrases used by all authors. However, since different authors also have different writing styles and research interests, there should be individual variation in word frequencies themselves, and the coauthorship information is clearly directly relevant to modeling both the universal dependency graph and the individual means. We propose a generalization of the Gaussian graphical model to such a setting, where each data point can have its own mean vector but the data points share the same covariance structure. We further assume that a network connecting the observations is available, and that the mean vectors exhibit network “cohesion”, a generic term describing the phenomenon of connected nodes behaving similarly, observed widely in empirical studies and experiments (Fujimoto and Valente, 2012; Haynie, 2001; Christakis and Fowler, 2007). We propose a computationally efficient algorithm to estimate the proposed Gaussian graphical model with network cohesion, and show that the method is consistent for estimating both the covariance matrix and the graph in high-dimensional settings under a network cohesion assumption. Simulation studies show that our method works as well as the standard Gaussian graphical model in the i.i.d. setting, and is effective in the setting of different means with network cohesion, while the standard Gaussian graphical model completely fails.

The rest of the paper is organized as follows. Section 2 introduces a Gaussian graphical model on network-linked observations and the corresponding two-stage model estimation procedure. An alternative estimation procedure based on joint likelihood is also introduced and we discuss why the two-stage estimation is preferable from both the computational and the theoretical perspective. Section 3 presents a formal definition of network cohesion and error bounds under the assumption of network cohesion and regularity conditions, showing we can consistently estimate the partial dependence graph and model parameters. Section 4 presents simulation studies comparing the proposed method to standard graphical lasso and the two-stage estimation algorithm to the joint likelihood approach. Section 5 applies the method to analyzing dependencies between terms from a collection of statistics papers’ titles and the associated coauthorship network. Section 6 concludes with discussion.

2 Gaussian graphical model with network cohesion

2.1 Preliminaries

We start with setting up notation. For a matrix , let be the th column and the th row. By default, we treat all vectors as column vectors. Let be the Frobenius norm of and

the spectral norm, i.e., the largest singular value of

. Further, let be the number of non-zero elements in , , and . For a square matrix , let and be the trace and the determinant of , respectively, and assuming is a covariance matrix, let be its stable rank. It is clear that for any nonzero covariance matrix .

While it is common, and not incorrect, to use the terms “network” and “graph” interchangeably, throughout this paper “network” is used to refer to the observed network connecting the observations, and “graph” refers to the conditional dependence graph of variables to be estimated. In a network or graph of size , if two nodes and of are connected, we write , or if is clear from the context. The adjacency matrix of a graph is an matrix defined by if and 0 otherwise. We focus on undirected networks, which implies the adjacency matrix is symmetric. Given an adjacency matrix , we define its Laplacian by where and is the degree of node . A well-known property of the Laplacian matrix is that, for any vector ,

We also define a normalized Laplacian where is the average degree of the network , given by

. We denote the eigenvalues of

by

, and their corresponding eigenvectors by

’s.

2.2 Gaussian graphical model with network cohesion (GNC)

We now introduce the heterogeneous Gaussian graphical model, as a generalization of the standard Gaussian graphical model with i.i.d. observations. Assume the data matrix contains independent observations . Each is a random vector drawn from an individual multivariate Gaussian distribution

(1)

where is a -dimensional vector and is a symmetric positive definite matrix. Let be the corresponding precision matrix and be the mean matrix, which will eventually incorporate cohesion. Recall that in the Gaussian graphical model, corresponds to the conditional independence relationship (Lauritzen, 1996). Therefore a typical assumption, especially in high-dimensional problems, is that is a sparse matrix; this both allows us to estimate when , and produces a sparse conditional dependence graph.

Model (1) is much more flexible than the i.i.d. graphical model, and it separates co-variation caused by individual preference (cohesion in the mean) from universal co-occurrence (covariance). The price we pay for this flexibility is the much larger number of parameters, and model (1) cannot be fitted without additional assumptions on the mean, since we only have one observation for each vector . The additional assumption we make in this paper is network cohesion: nodes that are connected in the observed network are likely to have similar mean vectors. Cohesion has often been observed empirically in social networks; for instance, in the coauthorship network example, cohesion indicates coauthors tend to have similar word preferences, which is reasonable since they work on similar topics and share at least some publications.

2.3 Fitting the GNC model

The log-likelihood of the data under model (1) is, up to a constant,

(2)

A sparse inverse covariance matrix and a cohesive mean matrix are naturally incorporated into the following two-stage procedure, which we call Gaussian graphical model estimation with Network Cohesion and lasso penalty (GNC-lasso) . [Two-stage GNC-lasso algorithm] Input: a standardized data matrix , network adjacency matrix , tuning parameters and .

  1. Mean estimation. Let be the standardized Laplacian of . Estimated the mean matrix by

    (3)
  2. Covariance estimation. Let be the sample covariance matrix of based on . Estimate the precision matrix by

    (4)

The first step is a penalized least squares problem, where the penalty can be written as

This can be viewed as a vector version of the Laplacian penalty used in Li et al. (2019) and penalizes the difference between mean vectors of connected nodes, encouraging cohesion in the estimated mean matrix. Both terms in (3) are separable in the coordinates and the least squares problem has a closed form solution,

(5)

In practice, we usually need to compute the estimate for a sequence of values, so we first calculate the eigen-decomposition of and then obtain each in linear time. In most applications, networks are very sparse, hence by taking advantage of sparsity and the symmetrically diagonal dominance of , the eigen-decomposition can be computed very efficiently (Cohen et al., 2014). Given , criterion (4) is a graphical lasso problem that uses the lasso penalty (Tibshirani, 1996) to encourage sparsity in the estimated precision matrix, and can be solved by the glasso algorithm (Friedman et al., 2008) efficiently or any of its variants, later improved further by Witten et al. (2011); Hsieh et al. (2014, 2013b). As a reminder, in the context of learning a graphical model, the primary parameter of interest is , not , which is more of a nuisance parameter, or at least reflects individual effects rather than information about the population as a whole.

2.4 An alternative: penalized joint likelihood

An alternative and seemingly more natural approach is to maximize a penalized log-likelihood to estimate both and jointly as

(6)

The objective function is bi-convex and the optimization problem can be solved by alternately optimizing over with fixed and then optimizing over with fixed until convergence. We refer to this method as iterative GNC-lasso. Though this strategy seems more principled in a sense, we implement our method with the two-stage algorithm, for the following reasons.

First, the computational complexity of the iterative method based on joint likelihood is significantly higher, and it does not scale well in either or . This is because when is fixed and we need to maximize over , all coordinates are coupled in the objective function, so the scale of the problem is . Even for moderate and , solving this problem requires either a large amount of memory or applying Gauss-Seidel type algorithms that further increase the number of iterations. This problem is exacerbated by the need to select two tuning parameters and jointly, because, as we will discuss later, they are also coupled.

More importantly, our empirical results show that the iterative estimation method does not improve on the two-stage method. The same phenomenon was observed empirically by Yin and Li (2013) and Lin et al. (2016), who used a completely different approach of applying sparse regression to adjust the Gaussian graphical model, though those papers did not offer an explanation. We conjecture that this phenomenon of maximizing penalized joint likelihood failing to improve on a two-stage method may be general. An intuitive explanation might lie in the fact that the two parameters and are only connected through the penalty: the Gaussian log-likelihood (2) without a penalty is maximized over by , which does not depend on . Thus the likelihood itself does not pool information from different observations to estimate the mean (nor should it, since we assumed they are different), while the cohesion penalty is separable in the variables and does not pool information between them either. An indirect justification of this conjecture follows from a property of the two-stage estimator stated in Proposition B in Appendix B, and the numerical results in Section 4 provide empirical support.

2.5 Model selection

There are two tuning parameters, and , in the two-stage GNC-lasso algorithm. The parameter controls the amount of cohesion over the network in the estimated mean and can be easily tuned based on its predictive performance. In subsequent numerical examples, we always choose from a sequence of candidate values by 10-fold cross-validation. In each fold, the sum of squared prediction errors on the validation set is computed and the value is chosen to minimize the average prediction error. For computational efficiency, we can also use the generalized cross-validation (GCV) statistic to select , which was shown to be effective in theory for ridge-type regularization (Golub et al., 1979; Li, 1986). The GCV statistic for is defined by

where we write to emphasize that the estimate depends on . The parameter should be selected to minimize GCV.

Given , we obtain and use as the input of the glasso problem in (4); therefore can be selected by standard glasso tuning methods, which may depend on the application. For example, we can tune according to some data-driven goodness-of-fit criterion such as BIC, or via stability selection. Alternatively, if the graphical model is being fitted as an exploratory tool to obtain an interpretable dependence between variables, can be selected to achieve a pre-defined sparsity level of the graph or chosen subjectively to make the graph as interpretable as possible. Tuning illustrates another important advantage of the two-stage estimation over the iterative method: when estimating the parameters jointly, due to the coupling of and the tuning must be done on a grid of their values and using the same tuning criteria. The de-coupling of tuning parameters in the two-stage estimation algorithm is both more flexible, since we can use different tuning criteria for each if desired, and more computationally tractable since we only need to do two line searches instead of a two-dimensional grid search.

3 Theoretical properties

In this section, we investigate the theoretical properties of the two-stage GNC-lasso estimator. Throughout this section, we assume the observation network is connected which implies that has exactly one zero eigenvalue. The results can be trivially extended to a network consisting of several connected components, either by assuming the same conditions for each component or regularizing to be connected as in Amini et al. (2013). Recall that are the eigenvalues of corresponding to eigenvectors . Finally, for two positive quantities and , we write to mean for some constant . All proofs are given in the Appendix.

3.1 Cohesion assumptions on the observation network

The first question we have to address is how to formalize the intuitive notion of cohesion. We’ll start with defining cohesion on a network for an arbitrary vector , and then give a concrete example of a network and the corresponding mean matrix that satisfy the assumptions.

Intuitively, cohesion of an arbitrary vector on a network should imply that is small in some sense. Equivalently, one could require to be small instead, because is the gradient of the cohesion penalty up to a constant and

It turns out that defining cohesion with respect to is easier for later derivations, so we will take this option. The vector also has a nice interpretation: the th coordinate of is given by

that is, represents the difference between and the average of the neighbors of node , after adjusting for its degree. Let be the eigen-decomposition of , and recall the eigenvalues are sorted in decreasing order. Any can be represented with a basis expansion where . Under cohesion, we would expect to be much smaller than . We formalize this idea in the following definition. [Cohesive vector on a network] Given a network and a vector , let be the expansion of in the basis of eigenvectors of . We say is cohesive on with rate if for all ,

(7)

which implies

An obvious but trivial example of a cohesive vector is a constant vector for some constant , which is perfectly cohesive on any network. More generally, we define the class of trivially cohesive vectors: [Trivial cohesion] We say is trivially cohesive on a network if

where is the projection of onto the subspace spanned by , i.e., a constant vector with all coordinates equal to the average of entries of . We say is nontrivially cohesive if it is cohesive but not trivially cohesive.

To handle nontrivial cohesion, we need an additional assumption on network structure, so that model complexity can be effectively controlled by the network cohesion assumption.

Given the network adjacency matrix of size with eigenvalues of the standardized Laplacian , define the effective dimension of the network

Note that spectral graph theory (Brouwer and Haemers, 2011) implies for some constant , and thus for sufficiently large , we always have . For many sparse and/or structured networks the effective dimension is much smaller than . And for such networks, nontrivially cohesive vectors exist.

Next, we give one such example of a lattice network which has much smaller effective dimension than . Assume is an integer and define the lattice network of nodes by arranging them on a spatial grid and putting an edge between nodes that are grid neighbors. All nodes will have the degree 2, 3, or 4 (2 at the corners and 3 along the side of the lattice, with all internal nodes have degree 4). [Cohesion on a lattice network] Assume is a lattice network on nodes, and is an integer. Then for a sufficiently large ,

  1. The effective dimension .

  2. There exist nontrivially cohesive vectors on the lattice network with rate .

3.2 Mean estimation error bound

Our goal here is to obtain a bound on the difference between and the estimated obtained by Algorithm 2, under the following cohesion assumption on the columns of . Each column of the mean matrix is cohesive over the network with rate where is a positive constant. Moreover, for every for some positive constant .

[Mean error bound] Assume model (1) and Assumptions 3.1 - 3.2 hold. Write . Then estimated by (3) with satisfies

(8)

with probability at least

for some positive constant .

The theorem shows that the average estimation error is vanishing as long as the cohesive dimension , with high probability as long as grows with . Intuitively, we would expect grows with which in turn grows with . Next, we will show that is a sufficiently accurate estimate of to guarantee good properties of the estimate of the precision matrix in step 2 of two-stage GNC-lasso, which is our primary target.

3.3 Inverse covariance estimation error bounds

We will need some additional notation and assumptions. The assumptions we need are the same as needed for glasso performance guarantees under the standard Gaussian graphical model (Rothman et al., 2008; Ravikumar et al., 2011).

Let be the Fisher information matrix of the model, where denotes the Kronecker product. In particular, under the multivariate Gaussian distribution, we have . Define the set of nonzero entries of as

(9)

We use to denote the complement of . Let be the number of nonzero elements in . Recall that we assume all diagonals of are nonzero. For any two sets , let denote the submatrix with rows and columns indexed by , , respectively. When the context is clear, we may simply write for . Define

where the vector operator gives the number of nonzeros in the vector while the matrix norm gives the maximum norm of the rows.

Finally, by analogy to the well-known irrepresentability condition for the lasso, which is necessary and sufficient for lasso to recover support (Wainwright, 2009), we need an edge-level irrepresentability condition. There exists some such that

If we only want to obtain a Frobenius norm error bound, the following much weaker assumption is sufficient, without conditions on and Assumption 3.3: Let and be the minimum and maximum eigenvalues of , respectively. There exists a constant such that

Let We use as input for the glasso estimator (4). We would expect that if is an accurate estimate of , then can be accurately estimated by glasso. The following theorem formalizes this intuition, using concentration properties of around and the proof strategy of Ravikumar et al. (2011).

Under the conditions of Theorem 3.2 and Assumption 3.3, if and , there exist some positive constants that only depend on and , such that if is the output of Algorithm 2 with , where

(10)

and sufficiently large so that

then with probability at least , then the estimate has the following properties:

  1. Error bounds:

  2. Support recovery:

    and if additionally then

As commonly assumed in literature, such as Ravikumar et al. (2011), we will treat , and to be constants or bounded. The Frobenius norm bound does not need the strong irrepresentability assumption and does not depend on and . Following the proof strategy in Rothman et al. (2008), this bound can be obtained under the much weaker Assumption 3.3 instead.

Theorem 3.3 as written is difficult to interpret due to the large number of terms. We give two specific examples that provide further intuition. Consider the trivial cohesion case when . In that case the network dimension does not matter, and the bounds become standard glasso bounds from Ravikumar et al. (2011). Thus when the mean does not vary over the network, we do not lose anything in terms of rates by using GNC-lasso instead of glasso. In contrast, if cohesion on the mean vectors is non-trivial, e.g., , the first four terms in (10) do matter, and the requirements on the ratio will depend on . As an example, consider the case , which holds for lattice networks and path networks.

Under the assumption of Theorem 3.3, if for some constant , then all the results of Theorem 3.3 hold with

where is a constant that only depends on , and . In particular, when

the GNC-lasso estimate is consistent. Corollary 3.3 implies that for we will need . This is strictly stronger than the condition required for the i.i.d. Gaussian data (Rothman et al., 2008; Ravikumar et al., 2011). This is not surprising, since we now have many more parameters to estimate due to the different mean vectors, but leveraging cohesion still allows for consistency under a reasonable set of conditions on , , and .

4 Simulation studies

In this section, we investigate the performance of the proposed method in various network cohesion settings. The observation network in our simulation study is a lattice network with nodes. At each node, we observe a random vector with dimension .

Gaussian noise settings:

The conditional dependence graph in the Gaussian graphical model is generated as an Erdös-Renyi graph on nodes, with each node pair connecting independently with probability 0.01. The Gaussian noise is then drawn from where , where is the adjacency matrix of , is the absolute value of the smallest eigenvalue of and the scalar is set to ensure the resulting has all diagonal elements equal to 1. This procedure is implemented in Zhao et al. (2012).

Mean vector settings:

To investigate how the method adapts to different degrees of cohesion, we construct each row as

where and are the th and th eigenvector of and is the mixing proportion. We then rescale so that , which makes the signal-to-noise ratio to be 1. The constant vector is perfectly but trivially cohesive; thus gives identical mean vectors for all observations and as increases, the means for different observations become more different. We consider , which corresponds to the quantity taking values , , , , respectively.

We evaluate performance on recovering the true underlying graph, as measured by the receiver operating characteristic (ROC) curve, along a graph estimation path obtained by varying . An ROC curve illustrates the tradeoff between the true positive rate (TPR) and the false positive rate (FPR), defined as

TPR
FPR

We start from comparing GNC-lasso to standard glasso, which does not use the network. Three versions of GNC-lasso results are reported: the oracle GNC-lasso, the CV-tuned GNC-lasso, and the GCV-tuned GNC-lasso. The oracle uses corresponding to the ROC curve with the highest with AUC, unknown in practice but providing a benchmark of the best possible performance. The CV-tuned GNC-lasso uses chosen by 10-fold cross-validation, which is our recommendation in practice. Finally, we also include a version with chosen by GCV, which is computationally more efficient that 10-fold CV.

(a) 1st configuration with
(b) 2nd configuration with
(c) 3rd configuration with
(d) 4th configuration with
Figure 1: Graph recovery ROC curves under four different degrees of cohesion corresponding to . Regular glasso fails as soon as there is any heterogeneity in the mean vectors while GNC-lasso works equally well for any amount of cohesion.

Figure 1 shows the ROC curves of the four methods obtained from 100 independent replications. In the i.i.d setting (when ), both of the two methods are effective at recovering the graph structure in this setting. The optimal GNC-lasso is almost identical to the glasso estimate as expected. The practical version of GNC-lasso (tuned by 10-fold CV) is only slightly inferior due to some noises introduced in the cross-validation procedure. This example shows that even under the i.i.d. setting, using GNC-lasso does not sacrifice much when comparing to glasso. As we increase , glasso begins to fail quickly. When , which is only a small deviation from the trivial cohesion setting, GNC-lasso still maintains similar performance while glasso becomes close to the random guess line (). When , the mean vectors are now completely orthogonal to the trivial direction, GNC-lasso still maintains its effectiveness, while glasso almost identifies no true edges in early stage.

Next, we compare the estimator obtained by iteratively jointly optimizing and in (6) to the two-stage estimator. We use the same four data generating mechanism as before. The optimal GNC-lasso (with the giving the best ROC curve) and the GNC-lasso with selected by 10-fold cross-validation are plotted, while we also include the ROC curve from the iterative estimation with the optimal (again, giving the best AUC). We do not include any practical version of the iterative method since it is not clear how we can tune it effectively in a computationally feasible way. The results are shown in Figure 2. It can be seen that the optimal GNC-lasso and the optimal iterative GNC-lasso give almost identical ROC curves, indicating that the significantly larger computational burden for the iterative estimate does not bring performance improvement. Since the GNC-lasso can be tuned effectively by cross-validation and GCV, it is clear that the two-stage GNC-lasso is a preferable and practical approach.

(a) 1st configuration with
(b) 2nd configuration with
(c) 3rd configuration with
(d) 4th configuration with
Figure 2: Conditional dependence graph selection ROC curves for GNC-lasso estimate, optimal GNC-lasso estimate and optimal iterative GNC-lasso estimate under the 3rd and 4th configuration in Figure 1 ( respectively).

5 Data analysis: learning associations between statistical terms

Here we apply the proposed method to the dataset of papers from 2003-2012 from four statistical journals collected by Ji and Jin (2016). The dataset contains full bibliographical information for each paper and was curated for disambiguation of author names when necessary. Our goal is to learn a conditional dependence graph between terms in paper titles, with the aid of the coauthorship network.

Figure 3: The coauthorship network of 635 statisticians (after pre-processing). The size and the color of each node correspond to the degree (larger and darker circles have more connections.

We pre-processed the data by removing authors who have only one paper in the data set, and filtering out common stop words (“and”, “the”, etc) as well as terms that appear in fewer than 10 paper titles. We then calculate each author’s average term frequency across all papers for which he/she is a coauthor. Two authors are connected in the coauthorship network if they have co-authored at least one paper, and we focus on the largest connected component of the network. Finally, we sort the terms according to their term frequency-inverse document frequency score (tf-idf), one of the most commonly used measures in natural language processing to assess how informative a term is

(Leskovec et al., 2014), and keep 300 terms with the highest tf-idf scores. After all pre-processing, we have authors and terms. The observations are 300-dimensional vectors recording the average frequency of term usage for a specific author. The coauthorship network is shown in Figure 3.

The interpretation in this setting is very natural; taking coauthorship into account makes sense in estimating the conditional graph, since the terms come from the shared paper title. We can expect that there will be standard phrases that are fairly universal (e.g., “confidence intervals”), as well as phrases specific to relatively small groups of authors with multiple connections, corresponding to specific research area (e.g., “principal components” ), which is exactly the scenario where our model should be especially useful relative to the standard Gaussian graphical model.

To ensure comparable scales for both columns and rows, we standardize the data using the successive normalization procedure introduced by Olshen and Rajaratnam (2010). If we select using 10-fold cross-validation, as before, the graphs from GNC-lasso and glasso recover 4 and 6 edges, respectively, which are very sparse graphs. To keep the graphs comparable and to allow for more interpretable results, we instead set the number of edges to 25 for both methods, and compare resulting graphs, shown in = Figure 5 (glass) and Figure 5 (GNC-glaso). For visualization purposes, we only plot the 55 terms that have at least one edge in at least one of the graphs.

Overall, most edges recovered by both methods represent common phrases in the statistics literature, including “exponential families”, “confidence intervals”, “measurement error”, “least absolute” (deviation), “probabilistic forecasting”, and “false discovery”. There are many more common phrases that are recovered by GNC-lasso but missed by Glasso, for example, “high dimension(al/s)”, “gene expression”, “covariance matri(x/ces)”, “partially linear”, “maximum likelihood”, “empirical likelihood”, “estimating equations”, “confidence bands”, “accelerated failure” (time model),“principal components” and “proportional hazards”. There are also a few that are found by Glasso but missed by GNC-lasso, for example, “moving average” and “computer experiments”. Some edges also seem like potential false positives, for example, the links between “computer experiments” and “orthogonal construction”, or the edge between “moving average” and “least absolute”, both found by glasso but not GNC-lasso.

Figure 4: Partial correlation graphs estimated by Glasso

Figure 5: Partial correlation graphs estimated by GNC-lasso

Figure 6: Projection of 55 terms by using the 2-D MDS.

Additional insights about the data can be drawn from the matrix estimated by GNC-lasso; glasso does not provide any information about the means. Each can be viewed as the vector of authors’ preferences for the term , we can visualize the relative distances between terms as reflected in their popularity. Figure 6 shows the 55 terms from Figure 5, projected down from to for visualization purposes by multidimensional scaling (MDS) (Mardia, 1978)

. The visualization shows a clearly outlying cluster, consisting of the terms “computer”, “experiments”, “construction”, and “orthogonal”, and to a lesser extent the cluster “Markov Chain Monte Carlo” is also further away from all the other terms. The clearly outlying group can be traced back to a single paper, with the title “Optimal and orthogonal Latin hypercube designs for computer experiments”

(Butler, 2001), which is the only title where the words “orthogonal” and “experiments” appear together. Note that glasso estimated them as a connected component in the graph, whereas GNC-lasso did not, since it was able to separate a one-off combination occurring in a single paper from a common phrase. This illustrates the advantage of GNC-lasso’s ability to distinguish between individual variation in the mean vector and the overall dependence patterns, which glasso lacks.

6 Discussion

We have extended the standard graphical lasso problem and the corresponding estimation algorithm to the more general setting in which each observation can have its own mean vector. We studied the case of observations connected by a network and leveraged the empirically known phenomenon of network cohesion to share information across observations, so that we can still estimate the means in spite of having mean parameters instead of just in the standard setting. The main object of interest is the inverse covariance matrix, which is shared across observations and represents universal dependencies in the population. while all observations share the same covariance matrix under the assumption of network cohesion. The method is computationally efficient with theoretical guarantees on the estimated inverse covariance matrix and the corresponding graph. Both simulations and an application to a citation network show that GNC-lasso is more accurate and gives more insight into the structure of the data than the standard glasso when observations are connected by a network. One possible avenue for future work is obtaining a stronger consistency result, which will likely involve penalizing the mean vectors and making additional structural assumptions. The absolute deviation penalty (Hallac et al., 2015) between connected nodes is a possible alternative, if the computational cost issue can be resolved through some efficient optimization approach. Another direction is to consider the case where the partial dependence graphs themselves differ for individuals over the network, but in a cohesive fashion; the case of jointly estimating several related graphs has been studied by Guo et al. (2011); Danaher et al. (2014). As always, in making the model more general there will be a trade-off between goodness of fit and parsimony, which may be elucidated by obtaining convergence rates in this setting.

Acknowledgements

E. Levina and T. Li (while a PhD student at the University of Michigan) were supported in part by an NSF grant (DMS-1521551) and an ONR grant (N000141612910). J. Zhu and T. Li (while a PhD student at the University of Michigan) were supported in part by NSF grants (DMS-1407698 and DMS-1821243). C. Qian was supported in part by Naitional Key R&D Porgram of China (2018YFF0215500).

Appendix A Proofs

First, recall the following matrix norm definitions we’ll need: for any matrix , , , and

The following lemma summarizes a few concentration inequalities that we will need.

[Concentration of norm of a multivariate Gaussian] For a Gaussian random vector , with a positive definite matrix and the largest eigenvalue of , we have,

(11)
(12)
(13)

for some generic constant . Further, if are i.i.d. observations from , then

(14)

where is the stable rank of .

Proof of Lemma A.

The first inequality (11) follows from concentration of a Lipschitz function of a sub-Gaussian random vector. Inequalities (12) and (13) follow from the definition of a sub-exponential random variable. Lastly, (14) follows from applying Bernstein’s inequality to (12) with . ∎

Proof of Proposition 3.1.

By Edwards (2013), the eigenvalues of are given by

(15)

Since the average degree for a lattice network, we ignore this constant. First, we show , which by definition of is equivalent to . Define the set of all eigenvalues satisfying this condition as

Then it is sufficient to show Applying the inequality for , we can see that it is sufficient to show , where

The cardinality of can be computed exactly by counting; for simplicity, we give an approximate calculation for when is sufficiently large. In this case the proportion of pairs out of the entire set of that satisfy the condition to be included in can be upper bounded by twice the ratio betwen the area of the quarter circle with radius and the area of the square. This gives

To prove the second claim, consider the such that all the inequalities in (7) hold as equalities and . Then, by noting that , we have

(16)

We need a lower bound for . By (15),

Substituting this lower bound for in (16), for a sufficiently large we have

Therefore, the we constructed is nontrivially cohesive.

We can represent each column of by taking the basis expansion in , obtaining the basis coefficient matrix such that . Let , where is the estimate (3). We can view as an estimate of . We first state the error bound for in Lemma A, and the bound for directly follows.

Under model (1) and Assumptions 3.23.1, if , we have

(17)

with probability at least for some constants , , , , and .
In Frobenius norm, we have

(18)

with probability at least .
Finally, if and , then

(19)

with probability at least .

Proof of Lemma A.

Solving (3), we can explicitly write out

In particular, for each column , the estimate can be written as

where . Let and be two dimensional vectors such that the th element of is given by while the th element of is given by