Log In Sign Up

Graph Compression with Application to Model Selection

by   Mojtaba Abolfazli, et al.

Many multivariate data such as social and biological data exhibit complex dependencies that are best characterized by graphs. Unlike sequential data, graphs are, in general, unordered structures. This means we can no longer use classic, sequential-based compression methods on these graph-based data. Therefore, it is necessary to develop new methods for graph compression. In this paper, we present universal source coding methods for the lossless compression of unweighted, undirected, unlabelled graphs. We encode in two steps: 1) transforming graph into a rooted binary tree, 2) the encoding rooted binary tree using graph statistics. Our coders showed better compression performance than other source coding methods on both synthetic and real-world graphs. We then applied our graph coding methods for model selection of Gaussian graphical models using minimum description length (MDL) principle finding the description length of the conditional independence graph. Experiments on synthetic data show that our approach gives better performance compared to common model selection methods. We also applied our approach to electrocardiogram (ECG) data in order to explore the differences between graph models of two groups of subjects.


page 1

page 2

page 3

page 4


Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models

A classic application of description length is for model selection with ...

Graph Similarity Description: How Are These Graphs Similar?

How do social networks differ across platforms? How do information netwo...

A Junction Tree Framework for Undirected Graphical Model Selection

An undirected graphical model is a joint probability distribution define...

Graph model selection by edge probability sequential inference

Graphs are widely used for describing systems made up of many interactin...

Minimum Description Length Revisited

This is an up-to-date introduction to and overview of the Minimum Descri...

Out-of-Distribution Detection using BiGAN and MDL

We consider the following problem: we have a large dataset of normal dat...

I Introduction

Graphs can be used to represent complex dependencies between many variables. Graph-based analysis are used in many disciplines such as social networks analysis, natural language processing, chemistry, and bioinformatics. Real-world graphs can be large and expensive to store and communicate. Developing efficient methods to compress graphical data are of interest for storage and transmission of data. Classic information theory methods for coding deal with sequential data. Unlike sequential data with starting and stopping points (ordering), graphs are unordered structures. Since ordering is essential in coding, graph coding is challenging. This means that graphs cannot be efficiently compressed using traditional, sequence-based compression methods.

In this paper, we first present new universal source coding methods for the lossless compression of unlabeled, unweighted, undirected, simple graphs. Since the graphs are unlabeled, vertices do not have distinct identifications except through their interconnections. Reference [1] referred to such a graph as a graph structure. Our approach has two steps. First, inspired by Steinruecken’s method for coding of unordered i.i.d sequences [2], we transform a graph, , into an equivalent rooted binary tree, . Then, we use graph statistics from to develop two classes of coders for

. The first class utilizes local properties (i.e., formation of graph motifs) such as triangles and the second class uses degree distribution, as a global graph statistics, along with local properties for encoding. This step uses past information (encoded vertices) to encode the connections at the current vertex. This way we can build a probabilistic model based on graph statistics (either by learning or estimating statistics) to encode graphs. Our coders reflect more information about the structure of

and therefore give shorter codelengths compared to structural coding of [1] for graphs with more structure than Erdös-Rényi graphs.

In the second half of the paper, we use graph coding for data analysis by using description length, which is the number of bits to describe data. Rissanen proposed using description length for model selection in his work on the minimum description length (MDL) [3, 4, 5]

. Rissanen’s MDL principle codifies model selection, which balances between how well a model describes the given data and the complexity of the model. Here, we develop a novel approach for model selection of Gaussian graphical models. We compute the description length of the conditional independence graph (the sparsity pattern of the precision matrix) using our lossless graph coding methods and the data under the assumed Gaussian distribution. The model that minimizes the summation of these two terms will be selected as the best model. Unlike other methods that may only consider the number of edges in conditional independence graph to account for model complexity, our approach considers the whole structure of the conditional independence graph by lossless compression.

We showed using synthetic and real-world data the advantages of our methods for both compression and Gaussian graphical model selection. Our approach outperforms competing source coding methods in two different scenarios: 1) compression of a single real-world graph, 2) compression of a graph from a particular type after learning the statistics of its type through training. We also compared our approach with common methods in the literature for model selection in Gaussian graphical models. The experimental results on synthetic data showed that our approach can recover true graph model of data with higher F1-score. We also considered a real-world dataset containing extracted features from 12-lead electrocardiogram (ECG) signals of a group of healthy people and a group of people with Kawasaki disease. We observed that there is a difference between graph model of healthy people and those with Kawasaki disease.

The paper is organized as follows. First, we provide an overview of previous works on universal compression methods for graphs in Section I-A. In Section II, we describe our approach for transforming a graph structure into a rooted binary tree and show that the compression of rooted binary tree is same as the compression of graph structure. Then, we introduce two classes of universal coders based on graph statistics and provide experimental results on synthetic and real-world graphs. In Section III, we present the application of graph coding for model selection in Gaussian graphical models. We also give the performance results on synthetic data and then apply our approach to ECG data. We make concluding remarks and give some directions for future work in Section IV.

I-a Prior Work

Graph compression is a relatively new area in source coding. References [6, 7, 8, 9] focused on the entropy analysis of graph compression. Other papers have provided practical graph compression algorithms. These algorithms can be designed for the compression of either unlabeled graph (i.e., graph structure) or labeled graph (i.e., encoding vertices labels together with graph structure). In the case of unlabeled graph, the decoder recovers a graph that is isomorphic to the input graph. In the case of labeled graph, the decoder recovers the exact graph (i.e., graph structure with labels) at the expense of longer codewords. In other words, encoding of unlabeled graphs benefits from isomorphism since there are different labeled graphs that have the same structure. Fewer bits are required, in general, to encode an unlabeled graph compared to a labeled graph with the same structure [1].

Reference [1] was the first study to develop a compression algorithm to reach the entropy of the distribution on unlabeled Erdös-Rényi graphs up to the first two terms. In [10], two universal coding methods for arbitrary unlabeled graphs based on degree distribution and formation of triangles were introduced. The authors in [11] presented asymptotically optimal structural compression algorithms for the compression of both unlabeled and labeled preferential attachment graphs. The compression of dynamic graphs generated by duplication model was studied in [12]. The authors developed algorithms for the compression of both unlabeled and labeled versions of such graphs. Reference [13]

introduced an algorithm for the compression of deep feedforward neural networks by modeling them as bipartite graph layers. An algorithm for the compression of sparse labeled graphs with marks on vertices and edges was introduced in

[14]. A general survey on lossless graph compression methods can be found in [15].

Previous compression methods are tailored to specific graph models and can perform poorly on other graph models or real-world graphs. Our approach provides a method to encode arbitrary unlabeled graph structures by building a probabilistic model based on graph properties (e.g., graph motifs and degree distribution). This general approach extracts more information from the graph structure than prior work and results in shorter codelength compared to other source coding methods.

The literature on model selection methods for Gaussian graphical models is rich. Existing methods can broadly fall into two main classes: information-based methods and resampling methods. Information-based methods, such as Bayesian information criteria (BIC) [16], Akaike information criteria (AIC) [17], and extended Bayesian information criteria (EBIC) [18], select the best model based on the log-likelihood of the data and the complexity of Gaussian graphical model. To account for model complexity, these methods typically consider the number of edges from the conditional independence graph, , induced by the Gaussian graphical model; while easy to obtain, this statistic gives only a rough approximation of the structure of . In contrast, our approach for model selection is to choose the best model based on the description length of the and the data when encoded with the resulting conditional independence graph [19].

Resampling methods measure the performance on out-of-sample data by splitting the data into into a subset of samples for fitting and use the remaining samples to estimate the efficacy of the model. The most common method in this class is cross-validation (CV) [20]. Other methods are Generalized Approximate Cross-validation (GACV) [21], Rotation Information Criterion (RIC) [22], and Stability Approach to Regularization Selection (StARS) [23]. The major shortcoming of resampling methods is their high computational cost since they require to solve the problem across all sets of subsamples.

A key aspect of model selection in Gaussian graphical models is the structure of the conditional independence graph. Resampling methods overlook this aspect while information-based methods only consider simple graph statistics such as the number of edges in the conditional independence graph. Our approach presents a different perspective to account for a more accurate model complexity. This relies on being able to compute the description length of the conditional independence graph, , using our lossless graph compression methods.

Ii Graph Compression

Consider an unweighted, undirected, simple, unlabeled graph , where is the set of vertices and is the set of edges. In this paper, we present a method to compress a graph structure inspired by Steinruecken’s method for coding unordered i.i.d binary sequences [2]. We use the terms compression and coding interchangeably in this paper as they denote the same process. One can see our methods as extension of [1] to take into account more graph structure. The general idea is to randomly pick a vertex, , and encode information about all the incident edges of . A neighbor of is then picked and the information about its incident edges are encoded. To implement this scheme, an unlabeled graph is transformed into a rooted binary tree . Each level of associated with the vertex from to be encoded. An encoder is designed to encode . A decoder will decode without error to recover , which is identical to up to an automorphism of the nodes. In this paper, we introduce two broad classes of graph coders to efficiently encode the rooted binary tree .

Ii-a Definitions and Notation

Let denote the th vertex in . We will exclusively use vertices to refer to the nodes in and nodes to refer to the nodes in . The total number of vertices in G is . Let denote an edge between vertices and in . The degree of vertex is the total number of vertices connected to . The degree distribution

is the probability distribution of degrees of vertices in

and is an often used statistics to differentiate between different classes of random graphs. The adjacency matrix of is shown by , a matrix where if there is an edge between and in .

The structure of the rooted binary tree is very important in the encoding process. We will divide the organization of into different levels (also known as depth). Let denote the th node of the -th level of . Note that the root node is . Figure 2 shows our naming convention for all the other nodes. A node in will can contain multiple vertices from . The cardinality of the node is shown with . By default, .

Definition 1.

A node is a left node if it is the left child of a node in level

. In our convention, left nodes have odd index value


Definition 2.

A node is a right node if it is the right child of a node in level . In our convention, right nodes have even index value .

Definition 3.

For a level , the first nonempty node refers to node :


Often times, we may wish to explicitly refer to the parent or children node of . We use to refer to the parent node of . We use and to refer to left and right child of node . In the example tree shown in Figure 2, node , node , and node .

Let denote the path from node to the root node. We find levels where and both contain left nodes and store those levels in . We also find levels where or contain left nodes and store those levels in .

For example in Figure 2, and . We can see that and (node refer to ) both have left nodes at level (i.e., nodes and , respectively) and therefore, we get . We can also verify that as node is another left node in .

Figure 1: An example graph with .
Figure 2: Binary tree representation of the example graph in Figure 1. The root contains all vertices of and next levels are built by branching into neighbors and non-neighbors of vertex among each subset of vertices. The pair of next to each node shows the level that the node belongs to, , and the position of the node in that level is determined by counting from the left to the right.

Ii-B Transforming Graph into Binary Tree

The rooted binary tree is built by traversing the unlabeled graph . Each node in corresponds to a set of vertices from . The root node corresponds to the set of all the vertices, . To build levels of the tree we pick a vertex ; we will specify shortly how it is picked. Vertex is removed from the set of vertices. The remaining vertices belonging to each node at level is split into two groups: the left node are all vertices connected to , and the right node contain those not connected to . Empty nodes are not split further. It is worth noting that the vertex is picked randomly from the first non-empty node at level , i.e., node . Figure 1 and Figure 2 show an example of and its corresponding rooted binary tree representation, , respectively. Since the order of nodes is irrelevant for the structure, here we have assumed .

Algorithm 1 gives the pseudocode for transforming graph into the rooted binary tree . Note that hereafter, when we traverse a level in the rooted binary tree , we start from the most left node and finish with the most right node at that level.

1:function GraphToTree()
2:     Create of with all vertices in
3:     for  to  do
4:         Remove a vertex, , randomly from node
5:         for each node in the th level of  do
6:              if  then
7:                   in
8:                   in                             
9:     return
Algorithm 1 Transform graph into rooted binary tree

Ii-C Encoding the Rooted Binary Tree

As we are only interested in coding the structure of (i.e., vertices up to an automorphism), we do not need to explicitly encode the set of vertices corresponding to the nodes in . It is sufficient to encode only the cardinality of nodes; we call the tree with cardinality as node values (see Figure 3). When we refer to encoding a node , we are referring to encoding the value of the node.

Furthermore, we only need to encode the left nodes as the value of the siblings (i.e., right nodes) can be deduced given the value of the parent nodes. Consider a nonempty node , we can see that


The reason for the discrepancy is because of our convention of always removing a vertex to encode at from the first nonempty node at each level.

Figure 3: Transforming in Figure 2 into . Each node’s value represents the cardinality of the the corresponding node in .

It is easy to see that once the decoder has reconstructed the tree and consequently , one can reconstruct graph isomorphic to the original graph . The procedure is very similar to the one for transforming a graph into a binary tree. We start with isolated vertices as determined by the cardinality of the root in . Then for each level , we connect the vertex to vertices in left nodes at level of . The resulted graph is isomorphic to the original graph .

The remaining problem now is encoding the values in the tree . The better the values can be predicted, the shorter the codelength. The encoder/decoder can predict the value either based on global properties of the graph (e.g., degree distribution), or based on local properties, which can depend only on the part of the tree already encoded/decoded. One such property is that can only take on integer values or .

It is important to realize that the properties used for encoding the tree should be properties of the original graph, since they presumably have statistical significance. As an example, if the original graph is modeled as Erdös-Rényi [24]

, it is easy to see that the node values are binomially distributed,

, where


The encoder uses the global property , which must be known by the decoder. In universal coding, this can be transmitted to the decoder initially. Section II-F outlines how to calculate and encode global properties. We call this approach IID coder, as it is based on the i.i.d property of the Erdös-Rényi graph. It is about equivalent to the approach introduced in [1] for the compression of Erdös-Rényi graphs. The pseudocode for IID coder is shown in Algorithm 2.

1:function EncodeIID()
2:     Encode via a positive integer encoder
3:     for  to  do
4:         for each left node  do
5:              Encode with from (3)               
Algorithm 2 Encode with IID coder

In practice, Erdös-Rényi graphs are not good models for real-world graphs as edges are usually not independent [25]

. Therefore, we would like to use more advanced graph properties. We classify our coding methods broadly into two classes: 1) Node-by-Node Coder and 2) Level-by-Node Coder. In Node-by-Node Coder, we still use binomial distribution; however, the edge probability,

, will depend on local properties (i.e., graph motifs) in . In Level-by-Node Coder, we use the degree distribution of as a global property of graph to encode the values of all left nodes in level of at the same time.

The value of a left node in can equivalently be seen as a count of edges of the original graph. Since only the number of edges are encoded, any local property used for encoding must be shared by all edges in a left node. Equivalently, any property of the original graph used for encoding must be convertible to a property of the tree . In other words, we convert the original graph into the tree , and any properties that we want to use for coding must then become properties purely of . We will describe this with more details for each class of coders.

Ii-D Class 1: Node-by-Node Coder

For Node-by-Node Coders, we will traverse back to the root to determine the existence of certain motif structures in . This will help us to better encode graphs that are not Erdös-Rényi as in these cases, edges are not independent of one another.

Ii-D1 Coding of Triangles

The first Node-by-Node Coder we consider is the triangle coder. The triangle coder results in shorter codelength for graph classes that induce more triangle structures such as scale-free graphs. A triangle is a cycle graph with three nodes, which is also a 3-clique. Statistics about triangles are often used to characterize graphs [26].

First, we describe how to deduce the existence of triangles in from the structure of . We know that the set of vertices corresponding to a left node are connected to the vertex . We can also deduce if there are edges between the set of vertices corresponding to the left node and the vertex . For that, we look at the ancestor of node in level . If it is a left node, then the set of vertices corresponding to the left node are connected to the vertex . To have a triangle, there must be an edge between and . It can be verified if the ancestor of at level is a left node. If that is the case, we can deduce a triangle forms between any of vertices corresponding to the left node , , and . For example, consider node in Figure 2. Since node is a left node in level , there is an edge between and . We can see that is also connected to because its ancestor in level (i.e., node ) is a left node. We can also verify that and are connected since node (i.e., node ) is a left node at level and therefore, connected to . Considering all these connections, we can deduce that form a triangle subgraph in . We can use the formation of triangles to encode the nodal values in as follows.

Similar to IID coder, we use binomial distribution. However, the triangle coder chooses between two binomial distributions: or , where is given by equation (3). We decide between these two distributions with the help of . Consider a left node , the coder will use to encode its value if has at least one element. Note that represents levels where and both contain left nodes. It means that vertices corresponding to left node and are both connected to vertex , and therefore, triangle forms among these vertices in . When is empty, we encode with as no triangle exists among these vertices in . The psuedocode for the triangle coder is given in Algorithm 3.

1:function EncodeTriangles()
2:     Encode via a positive integer encoder
3:     for  to  do
4:         for each left node  do
5:              if   then
6:                  Encode
7:              else
8:                  Encode                             
Algorithm 3 Encode with triangles from Class 1

Ii-D2 Coding with the number of common neighbors

In a triangle subgraph, two connected vertices share a single common neighbor. But it may be that two vertices share multiple common neighbors as there is usually a correlation between having an edge between two vertices and the number of their common neighbors. This property is used for link prediction in complex networks [27, 28]. Therefore, we can generalize the triangle encoder to encoding with common neighbors. Instead of using or to parameterize the binomial distribution, we can use where denote the number of common neighbors (note that is the same as in coding with triangles). In other words, we find how many edges exist between any vertex corresponding to left node and vertex and utilize it as statistics that built the graph. The number of common neighbors can range from to the maximum degree of a vertex in , which we denote by .

The way we encode nodal values of with this coder is as follows. For a left node , we look at the cardinality of . It tells us how many vertices already encoded exist that any of vertices corresponding to and are both connected to them. This determines which parameter to pick for binomial distribution.

Ii-D3 Coding over 4-node motifs

We can explore motifs of larger sizes in to develop new coders. However, computational cost can be a limiting factor. Here, we are interested to extend coding to motifs of size 4 vertices. Figure 4 illustrates motifs that we look for. Note that the priority starts with 4-clique. If 4-clique does not exist, then we look for double-triangle, and finally 4-cycle. Each of these motifs can be encoded with , , and , respectively. If none of them exist, then we encode left node using .

Figure 4: Motifs of interest over four vertices with three realizations for double-triangle.

As we mentioned earlier, we find the appropriate parameter to encode nodal values of by looking at the structure of encoded/decoded part of tree up to that point. Depending on the cardinality of , different cases are possible:

  • For any pair and in where , we check to see if we can find a case in which the first nonempty node in level (i.e., ) lies in the left subtree of the tree rooted at each node in level . If such a case exists, we use to encode the value of . Otherwise, the configuration 3 of double-triangle in Figure 4 occurred; thus, we use .

  • Two cases are possible

    • For the first nonempty node in level , and node , we check to see if and contain a left node at the same level. If such a node exists, the configuration 1 of double-triangle in Figure 4 occurred; thus, we use .

    • For node and , we check to see if and contain a left node at the same level. If such a node exists, the configuration 2 of double-triangle in Figure 4 occurred; thus, we use .

  • First, we need to find which contains all the levels that or have left nodes. For and any pair and in where , we look for a case in which the first nonempty node in level (i.e., ) lies in the left subtree of the tree rooted at any node in level . If such a case exists, we use to encode the value of .

  • If none of the abovementioned cases occurred, we use to encode the value of .

Due to similarity, we skip the pseudocode for this coder to avoid repetition.

Ii-E Class 2: Level-by-Node Coder

One important graph statistics is the degree distribution . Class 2 coders utilize the degree distribution in addition to graph motifs presented in Class 1 coders to efficiently encode .

For a given level, , in , Class 1 coders encode the value of each left node (and deduces the value of the right node) independently of one another. Class 2 coders on the other hand, utilize the degree distribution to encode the values of left nodes at the same level altogether.

The number of left nodes encountered in , , is equal to the number of edges, , in . We know that the degree of vertex is lower bounded by . The sum of the values of left nodes in level is , where is the degree of vertex . We can use this relationship to encode all the left nodes in level of at the same time.

Encoding with Class 2 entails two steps. First, we need to send the degree of . Then, we send the conditional probability of seeing specific values for left nodes at level conditioned on their summation to be .

For the first step, we use the degree distribution and the fact that . Thus, we encode the degree of with


One should notice that at the time of decoding a given level, the decoder knows based on the tree up that point and thus, it can calculate (4). Now, the decoder can compute the summation of the left nodes’ values at level with


Let denote last left node in level . To encode left nodes’ values in level , we need to compute the joint probability of observing specific values for left nodes at level (i.e., left nodes take values respectively) conditioning on their summation being equal to (5).


where the numerator is the joint probability of observing as the values of left nodes and the denominator is the probability corresponding to all configurations with the same summation for left nodes’ value, . By independent assumption on the values of left nodes, the joint probability distribution will reduce to the product of individual probabilities.

Assuming the independence assumption, two cases are possible:

  1. Identically distributed: This case happens when we want to compute (II-E) using IID coder. Then, the probability of success is the same for all probabilities and the problem reduces to a counting problem. We can encode the values of left nodes using the probability distribution

  2. Not identically distributed: If we use any other coders except IID coder, then different left nodes in the same level do not have the same parameter for encoding. For example as we showed with the triangle coder from Class 1, node in Figure 3 would be encoded with while node would be encoded with . To take this into account, we need to encode the values of left nodes using the following probability distribution


    where is the relevant parameter for each left node in the tree (e.g., or when using triangles for coding).

    The denominator in (2) is a generalized version of binomial distribution called Poisson binomial distribution [29]. It is defined as the sum of independent binomial distributions that are not necessarily identically distributed. Due to the involvement of different probabilities, calculation of denominator in (2) can be cumbersome. To resolve this, some methods were proposed in the literature. Recursive methods were developed in [30, 31] that can compute the denominator in (2) in time. We pay this extra computational cost to have a more efficient coder since the entropy of Poisson binomial distribution is bounded above by the entropy of binomial distribution with the same mean [32]. The pseudocode to encode with Class 2 is given in Algorithm 4.

1:function EncodeClass2()
2:     Encode via a positive integer encoder
3:     for  to  do
4:          Number of left nodes in
5:          Summation of left nodes’ values in level
7:         Encode with (4)
8:          Compute (II-E)
9:         Encode      
Algorithm 4 Encode with Class 2

Ii-F Calculation and Encoding of Statistics

We consider encoding in two scenarios: learned coding, and universal coding. In learned coding, we are given a set of training graphs of a particular class and have to learn local and global statistics; these statistics, then, are shared to both encoder and decoder. In universal coding, there is no training set and the encoder encodes a single graph. It also has to communicate to the decoder what is the statistics. Below we describe calculation and communication of statistics for each of these scenarios.

Ii-F1 Learned Coding

For learned coding, we need to learn statistics from a set of training graphs. To do that, each statistic is calculated by taking an average over the same statistic in the training set. The edge probability in coding with IID coder can be estimated by the average degree. Other edge statistics are more tricky and should reflect the procedure used for encoding the graph. It means that we cannot simply count the number of triangles in a graph to compute . The reason is when we want to encode a graph with triangles, we look for the formation of a triangle in a specific way forced by the coding algorithm as described earlier. Thus, we need to transform each graph from the training set into its rooted binary tree representation and find statistics with coding algorithm. To estimate and , we traverse each the same way we did for coding and divide the tree’s nodes into those coded with and those coded with . To estimate each of and , we compute the ratio of the summation of left nodes’ values to the summation of left and right nodes’ values in that group. The average over all gives the estimation for and . The same recipe is used to estimate edge statistics for common neighbors and 4-node motifs.

The estimation of degree distribution in Class 2 is straightforward. It can be estimated through the histogram. We compute the degree distribution for each graph in the training set and the final degree distribution is estimated by taking average over them.

Ii-F2 Universal Coding

For encoding average degree in IID coder, we can send the number of edges in . The number of bits required to encode the number of edges is about bits. Once the decoder knows the number of edges, it can compute the parameter of IID coder, , with

For other local statistics, we use sequential estimation best outlined in [33, Section 13.2]. For example in coding with triangles, we use sequential estimation of and , specifically the KT estimator [34, 35], which is

where are the summation of left and right nodes’ values previously coded in the rooted binary tree, respectively. One should note that the procedure for updating the probabilities and is different for each class. For Class 1 coders, the probabilities will be updated after coding each node of the rooted binary tree. However, for Class 2 coders, the probabilities will be updated after coding each level. The reason is that all nodes at the same level will be encoded together. To estimate statistics for common neighbors and 4-node motifs, we utilize similar sequential approach used for the estimation of and .

For the degree distribution, we calculate the degree histogram for the whole graph, and use this for coding. The degree of a node is between 0 and . We can therefore think of the degree histogram as putting each of the (unlabeled) nodes into one of buckets, and encoding this can be done by encoding the counts in the buckets. The number of possible configurations is a standard problem in combinatorics: , which can be transmitted with

Hereinafter, in our experiments, we use the above mentioned approach to calculate and communicate local and global statistics of graphs.

Ii-G Experiments

We consider two cases for experiments: 1) using learned coding to encode a graph of the same class of training set, 2) using universal coding to encode a single graph. In the former case, we generate synthetic data for different classes of graphs, whereas in the later case we measure the performance on real-world graphs. We evaluate the performance of our coding methods versus IID coder and also compare the performance of Class 1 and Class 2 against each other.

For learned coding, we first need to learn statistics for coders. We considered different classes of graphs in our experiments. In all cases, learning was done on 50 graphs. Then, we used those statistics from the training to encode a test graph of the same type. The results are shown in Figure 5. As expected, for Erdös-Rényi graph, IID coder is efficient and all other coders do not offer an improvement. However, for Barabási-Albert and Watts Strogatz graphs, our proposed coders outperform IID coder by a significant margin. One can observe that coders in Class 2 have shorter codelength than their counterparts in Class 1. For Barabási-Albert graph, all encoders in Class 2 almost have the same performance and therefore, choosing one or another does not matter. However, for Watts Strogatz graph, coding with common neighbors through Class 2 is the most efficient.

Figure 5: Comparison of codelength associated with each coder for Erdös-Rényi, Barabási-Albert, and Watts Strogatz graphs. Left plots are for the graphs of variable size and the average degree of . Right plots are for graphs of the same size and ranging from to . In Barabási-Albert graph, denote the number of new edges added by each node. In Watts Strogatz, is the number of nearest neighbors in the model and refer to rewiring probability.

For compression of real-world graphs, we use universal coding since there is no training set. Note that the statistics associated with each coder is required to be communicated to the decoder as described in Section II-F. We measure the compression performance of Class 1 and Class 2 coders against each other and also compute the codelength associated with IID coder and arithmetic coding (represented by labeled iid) for comparison. The following undirected graphs, publicly available to download 111,222, are considered:

  • Genetic interaction: a gene interaction network [36].

  • Economic Network: the economic network of Victoria, Australia in 1880 [36].

  • Transportation network: A simple graph of Minnesota road network [36].

  • Collaboration network: extracted from arXiv papers and covers collaborations among authors in General Relativity and Quantum Cosmology category [37].

  • Facebook-politician: represents blue verified Facebook page networks of politicians [38].

  • Internet (AS level): An undirected graph representing AS peering information inferred from University of Oregon Route Views Project on January 1, 2000 [39].

As we can see from Table I, coders in Class 2 give a better performance compared to their counterparts in Class 1. This improvement comes at the cost of computing (2). Moreover, shorter codelength is achievable by using Class 1 over IID coder. Another observation is that coding with common neighbors by means of Class 2 outperforms other coders in most cases. However, there are some cases that coding over 4-node motifs and triangles offer improvement over common neighbors. There is no single coder that is best for all graphs. This is no different than the situation for encoding sequences. If one wants the most efficient coder for a particular graph, one can encode with multiple algorithms, and choose the shortest, indicating by a header which is used or use soft combining [40]. In the end it is a tradeoff between compression and complexity.

Name Labeled iid IID coder Class 1 Class 2
Triangle ComNei 4-node Degree Triangle ComNei 4-node

Genetic interaction
4227 39484 365599 317379 273444 235011 249467 280375 235787 206485 220191
Economic network 1258 7513 61224 47592 49790 49597 46407 45500 44159 43996 40730
Transportation network 2642 3303 37937 35424 14619 14620 14596 12712 12530 12531 12540
Collaboration network 5242 14496 164224 119038 60375 52041 66203 106215 53647 47320 60182
Facebook-politician 5908 41729 423451 340070 222500 199514 206720 327203 195739 172386 179836
Internet (AS level) 3570 7750 86206 47685 48266 46665 47222 31107 30469 29708 30897

Table I: Compression length (bits) for some real-world graphs. Labelled iid encodes the graph with its labels. Degree denote IID coder when we use it along degree distribution in Class 2. The best value for each case is boldfaced.

Iii Model Selection in Gaussian Graphical Models

In this section, we introduce an application of graph coding for model selection in Gaussian graphical models. The goal is to find a graph that describes dependencies among multivariate Gaussian variables accurately [41]

. Multivariate Gaussian distributions are widely used in modeling real-world problems where the relationship among approximately normally distributed variables is of interest. They have been used to model different datasets such as stock returns

[42, 43], protein-protein interactions [44], and brain functional activities [45].

Let be a

dimensional random vector with multivariate Gaussian distribution

. The inverse of covariance matrix, , is known as the precision matrix. If the entry of the precision matrix (i.e., ) is , then and are conditionally independent given all the other variables. Therefore, we can visualize the conditional independence relationships between any pair of variables as an unweighted, undirected graph, whose adjacency matrix, is such that


The graph is known as the conditional independence graph. Its structure is determined by the sparsity pattern of .

Often, we are interested in the problem of estimating from i.i.d observations. To do that, we need to estimate first. The problem is especially difficult when . In this case, the maximum likelihood estimate does not exist. In most approaches, the estimator of contains a regularization term to 1) prevent overfitting, 2) control the sparsity of the estimated solution. A popular sparse estimator of the precision matrix is found by maximizing the penalized log-likelihood [41, 16, 46]


where is the sample covariance matrix, and is the regularization parameter that controls sparsity. The performance of the estimator in (10) is highly influenced by the selection of . Depending on the value of , the conditional independence graph can range from a dense graph for small values of to a graph with zero edges when takes large values. To find the best value of the regularization parameter , model selection techniques are being used.

We use the MDL principle and select that minimizes the sum of the description length of the conditional independence graph structure and data when encoded under that graph structure


where is the description length of the conditional independence graph , and is the description length of data when encoded with .

To compute the description length of the conditional independence graph, , we first need to find the underlying conditional independence graph . By applying an estimator to the sequence of i.i.d observations from variate Gaussian distribution, we obtain an estimated precision matrix for each realization of . We have dropped the explicit dependence on from and other related precision and covariance matrices for the sake of simplicity of notation. In this paper, we use graphical lasso [41], which is one of the most commonly used solvers of penalized log-likelihood in Gaussian graphical models. It is worth mentioning that our approach can be easily applied to any other estimator. The conditional independence graph is obtained from the estimated precision matrix . We then pick a graph coder described in Section II to compute the description length of .

To compute the description length of the data, , we face two challenges. The first challenge is to deal with real-valued data where lossless source coding is not generalized directly. Second, we have to encode the data based on the underlying conditional independence graph . To encode real-valued data, we assume a fixed-point representation with a (large) finite number, , bits after the period, and an unlimited number of bits before the period [3]. Therefore, the number of bits required to represent data according to the probability distribution function (pdf) is given by


Since we are only interested in relative codelengths between different models, the dependency on cancels out.

The second challenge is how to encode the data with respect to . As mentioned, the data is generated by multivariate Gaussian distribution, which can be characterized by covariance matrix . Since we do not have access to the true covariance matrix , we can find an estimate based on the sample covariance matrix . It should be noted that the structure of should match with the structure of obtained from the previous step


This problem is known as matrix completion problem [47]. Dempster in [48], proved the existence and uniqueness of maximum likelihood solution for when the sample covariance matrix is positive definite. He presented a coordinate-descent algorithm to find the solution iteratively. It should be noted that the precision matrix obtained from the graphical lasso solution is not necessarily equal to the inverse of estimated covariance matrix .

Once we estimate , we use predictive MDL [5] to compute


where is the conditional pdf and denote the maximum likelihood estimate of parameters, which in this case is the estimated covariance obtained under . The codelength in (14) is a valid codelength since it is sequentially decodable. Note that it does not work for the first few samples, as there is no estimate for . Instead, we encode the first few samples with a default distribution, which is the same among different realizations of .

Finally, the best conditional independence graph structure is obtained by minimizing over . In the following, we present an algorithm to find the best graph model of data associated with .

Input: Samples and regularization parameters .
      Output: Best graph model of data .

1:for Each realization of do
2:     Apply graphical model estimator to find and build based on (9).
3:     Use any graph coder in Section II to compute .
4:     Compute via (14) where denote
obtained from (13).
5:     Add up codelengths resulted from step (3) and step (4).
6:return associated with the shortest total codelength in step (5).
Algorithm 5 Find the best graph model of data via graph coding

Iii-a Experiments

We tested our approach on both synthetic and real-world data. For synthetic data, we consider different conditional independence graph structures and the goal is to recover true conditional independence graph. For real-world data, we applied our approach to ECG data of a group of healthy subjects and a group of subjects with Kawasaki disease. We are interested to determine if there is any difference between the conditional independence graph of healthy group versus Kawasaki group.

We now provide simulation results on synthetic data generated from zero-mean multivariate Gaussian distribution with known precision matrix, . We will compare the performance of graph coding methods against other methods in the recovery of conditional independence graph

. We use F1-score as a widely used metric in the literature. F1-score is the harmonic mean of precision and recall where the precision is the ratio of the number of correctly estimated edges to the total number of edges in the estimated conditional independence graph, and the recall is the ratio of the number of correctly estimated edges to the total number of edges in the true conditional independence graph

[23]. We consider two cases: 1) the number of observations is larger than the number of variables, , with and 2) the number of observations is smaller than the number of variables, , with . The experiments were repeated for . We applied the graphical lasso estimator described in [41] for when and when to cover a wide range of structures from dense graphs to totally isolated nodes.

Type Optimum Benchmark methods IID Class 1 Class 2
CV BIC EBIC Triangle ComNei 4-node Degree Triangle ComNei 4-node
Cycle 100 200 1 0.26 0.64 0.75 1 0.99 0.99 1 1 1 1 1
Cycle 200 400 1 0.23 0.61 0.69 1 1 1 1 1 1 1 1
AR(1) 100 200 0.99 0.26 0.65 0.75 0.65 0.99 0.99 0.99 0.99 0.99 0.99 0.99
AR(1) 200 400 1 0.23 0.62 0.70 0.99 0.99 0.99 1 1 1 1 1
ER 100 200 0.71 0.28 0.63 0.40 0.68 0.69 0.70 0.70 0.70 0.70 0.70 0.70
ER 200 400 0.80 0.27 0.68 0.75 0.79 0.78 0.78 0.79 0.79 0.79 0.79 0.79
Hub 100 200 0.50 0.23 0.48 0.01 0.47 0.48 0.48 0.48 0.49 0.49 0.49 0.49
Hub 200 400 0.44 0.22 0.43 0.16 0.43 0.43 0.43 0.43 0.44 0.44 0.44 0.44
Table II: F1-score of conditional independence graph recovery using different model selection methods. Results are the average over 50 replications when .
Type Optimum Benchmark methods IID Class 1 Class 2
CV BIC EBIC Triangle ComNei 4-node Degree Triangle ComNei 4-node
Cycle 100 50 0.89 0.28 0.26 0.74 0.81 0.89 0.89 0.89 0.89 0.89 0.89 0.89
Cycle 200 100 0.97 0.23 0.29 0.69 0.76 0.96 0.96 0.96 0.97 0.96 0.96 0.96
AR(1) 100 50 0.89 0.27 0.25 0.74 0.81 0.89 0.89 0.89 0.89 0.89 0.89 0.89
AR(1) 200 100 0.97 0.23 0.28 0.69 0.73 0.95 0.95 0.97 0.97 0.96 0.96 0.97
ER 100 50 0.54 0.28 0.26 0.40 0.47 0.49 0.49 0.49 0.36 0.46 0.46 0.45
ER 200 100 0.66 0.29 0.34 0.63 0.63 0.64 0.64 0.64 0.58 0.63 0.63 0.63
Hub 100 50 0.32 0.18 0.17 0.02 0.25 0.15 0.15 0.13 0.22 0.25 0.24 0.24
Hub 200 100 0.31 0.15 0.20 0.03 0.25 0.13 0.13 0.13 0.26 0.27 0.27 0.27
Table III: F1-score of conditional independence graph recovery using different model selection methods. Results are the average over 50 replications when .

The reason for considering smaller range for is the fact that the graphical lasso estimator does not converge for in our test cases. The values of are equally spaced apart with the step size of . Multivariate Gaussian data was generated with different precision matrix structures that have been frequently used as test cases in the literature [49, 50]:

  • Cycle structure with .

  • Autoregressive process of order one AR(1) with .

  • Erdös-Rényi (ER) structure with where

    is a matrix with off-diagonal elements taking values randomly chosen from uniform distribution

    with the probability of and diagonal values set to zero. To keep positive definite, we choose where

    is the absolute value of the minimum eigenvalue of

    . Here

    is the identity matrix of size


  • Hub structure with two hubs. First, we create the adjacency matrix by setting off-diagonal elements to one with the probability of and zero otherwise. Next, we randomly select two hub nodes and set the elements of the corresponding rows and columns to 1 with the probability of and zero otherwise. After that for each nonzero element , we set with a value chosen randomly from uniform distribution . Then, we set . The final precision matrix is obtained by with the same set-up as Erdös-Rényi structure.

Table II and Table III show the results of applying different methods to recover the conditional independence graph by applying graphical lasso as the estimator for and cases, respectively. To have a ground-truth for comparison, we provide the highest possible value for F1-score on the regularization path and show it in the column with name Optimum. For benchmark methods, we considered CV, BIC, and EBIC methods. The results for CV are given for 5-fold CV, and for EBIC method are given for recommended value of [18]. The results for graph-coding methods are obtained by following the steps outlined in Algorithm 5. The values are the average of 50 Monte Carlo trials.

It can be seen that graph-coding techniques, i.e., columns named with IID, Class 1, and Class 2, outperform other methods in most of cases. Furthermore, F1-score associated with recovered conditional independence graph by these methods is so close to the optimum value on the regularization path. Among graph coding methods, we observed that coding with Class 1 and Class 2 coders give better performance than coding with IID coder in most of the cases. In addition, often times coders in Class 2 may offer a slight improvement over their counterparts in Class 1 coders. These findings confirm the necessity of having more advanced graph coders that reflect more information about the graph in their codelength.

Figure 6: Conditional independence graph of (a) healthy subjects, (b) Kawasaki subjects.

We also tested our approach on a real-world dataset. This dataset contains extracted features from 12-lead ECG signals of a group of healthy subjects and a group of subjects with Kawasaki disease. All subjects are of age one to three years. The healthy group has samples and the Kawasaki group has samples. This dataset was initially processed to select features with empirical distribution close to normal distribution. The reason is that the underlying assumption in Gaussian graphical models is that the data is normally distributed and this is also a key assumption for encoding data under our approach. After screening all features and selecting those with approximately normal distribution, we ended up with ECG features. We applied our approach to find the graph model of data for each group of subjects when takes equally spaced apart values with the step size of within the range . The results show that the graph model of healthy subjects differs from the graph model of subjects with Kawasaki disease as given in Figure 6. The features with a number from to (referring to the lead number) at the end of their name vary across different leads. The rest of features are the same across all leads. The conditional independence graph given by our approach for each group of subjects is the same across different graph coders. As it can be seen, the conditional independence graph of healthy subjects is sparser with only edges compared to edges in the conditional independence graph of Kawasaki subjects. Moreover, these two graphs share edges.

Iv Conclusion

In this paper we developed some universal coders based on graph statistics for the compression of unlabeled graphs. We achieved this by first transforming graph structure into a rooted binary tree and showing that the compression of graph structure is equivalent to the compression of its associated rooted binary tree. Then, we used graph statistics to develop two main classes of graph coders. The first class uses graph motifs as local properties and the second class utilizes degree distribution as a global statistics along with graph motifs for coding. We introduced an application of graph coding for model selection in Gaussian graphical models.

For future work, we will extend this work in two directions. First is to extend graph compression to undirected graphs with attributes where graph topology will be utilized for the compression of structure and attributes. Second is to extend graph coding to directed graphs and using developed coders for data analysis purposes. Similar to undirected graphs, the idea is to transform directed graph into a rooted tree and encode nodal values on the rooted tree.


  • [1] Yongwook Choi and Wojciech Szpankowski. Compression of graphical structures: Fundamental limits, algorithms, and experiments. IEEE Transactions on Information Theory, 58(2):620–638, 2012.
  • [2] C. Steinruecken. Compressing sets and multisets of sequences. IEEE Transactions on Information Theory, 61(3):1485–1490, March 2015.
  • [3] Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of statistics, 11(2):416–431, 1983.
  • [4] Jorma Rissanen. Modeling by shortest data description. Automatica, pages 465–471, 1978.
  • [5] Jorma Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080–1100, 1986.
  • [6] Tomasz Luczak, Abram Magner, and Wojciech Szpankowski. Structural information and compression of scale-free graphs. on internet, 2017.
  • [7] T. Luczak, A. Magner, and W. Szpankowski. Asymmetry and structural information in preferential attachment graphs. ArXiv e-prints 1607.04102, July 2016.
  • [8] Amir R. Asadi, Emmanuel Abbe, and Sergio Verdu. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587, June 2017.
  • [9] Payam Delgosha and Venkat Anantharam. Universal lossless compression of graphical data. IEEE Transactions on Information Theory, 66(11):6962–6976, 2020.
  • [10] Anders Host-Madsen and June Zhang.

    Coding of graphs with application to graph anomaly detection.

    In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1829–1833. IEEE, 2018.
  • [11] Tomasz Łuczak, Abram Magner, and Wojciech Szpankowski. Compression of preferential attachment graphs. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1697–1701. IEEE, 2019.
  • [12] Krzysztof Turowski, Abram Magner, and Wojciech Szpankowski. Compression of dynamic graphs generated by a duplication model. Algorithmica, 82(9):2687–2707, 2020.
  • [13] Sourya Basu and Lav R Varshney. Universal and succinct source coding of deep neural networks. arXiv preprint arXiv:1804.02800, 2018.
  • [14] Payam Delgosha and Venkat Anantharam. A universal low complexity compression algorithm for sparse marked graphs. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2349–2354. IEEE, 2020.
  • [15] Maciej Besta and Torsten Hoefler. Survey and taxonomy of lossless graph compression and space-efficient graph representations. arXiv preprint arXiv:1806.01799, 2018.
  • [16] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • [17] Patricia Menéndez, Yiannis AI Kourmpetis, Cajo JF ter Braak, and Fred A van Eeuwijk. Gene regulatory networks from multifactorial perturbations using graphical lasso: application to the dream4 challenge. PloS one, 5(12):e14147, 2010.
  • [18] Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604–612, 2010.
  • [19] Mojtaba Abolfazli, Anders Host-Madsen, June Zhang, and Andras Bratincsak. Graph coding for model selection and anomaly detection in gaussian graphical models. arXiv preprint arXiv:2102.02431, 2021.
  • [20] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
  • [21] Heng Lian. Shrinkage tuning parameter selection in precision matrices estimation. Journal of Statistical Planning and Inference, 141(8):2839–2848, 2011.
  • [22] Shaun Lysen. Permuted inclusion criterion: a variable selection technique. Publicly accessible Penn Dissertations, page 28, 2009.
  • [23] Han Liu, Kathryn Roeder, and Larry Wasserman. Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in neural information processing systems, pages 1432–1440, 2010.
  • [24] Béla Bollobás. Random Graphs. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2001.
  • [25] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
  • [26] Albert-László Barabási. Network science. Cambridge university press, 2016.
  • [27] Lin Yao, Luning Wang, Lv Pan, and Kai Yao. Link prediction based on common-neighbors for dynamic social network. Procedia Computer Science, 83:82–89, 2016.
  • [28] Shibao Li, Junwei Huang, Zhigang Zhang, Jianhang Liu, Tingpei Huang, and Haihua Chen. Similarity-based future common neighbors model for link prediction in complex networks. Scientific reports, 8(1):1–11, 2018.
  • [29] Yuan H Wang. On the number of successes in independent trials. Statistica Sinica, pages 295–312, 1993.
  • [30] Xiang-Hui Chen, Arthur P Dempster, and Jun S Liu. Weighted finite population sampling to maximize entropy. Biometrika, 81(3):457–469, 1994.
  • [31] R. E. Barlow and K. D. Heidtmann. Computing k-out-of-n system reliability. IEEE Transactions on Reliability, R-33(4):322–323, 1984.
  • [32] Peter Harremoës.

    Binomial and poisson distributions as maximum entropy distributions.

    IEEE Transactions on Information Theory, 47(5):2039–2041, 2001.
  • [33] T.M. Cover and J.A. Thomas. Information Theory, 2nd Edition. John Wiley, 2006.
  • [34] R. Krichevsky and V. Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207, Mar 1981.
  • [35] F. M J Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree weighting method: basic properties. Information Theory, IEEE Transactions on, 41(3):653–664, 1995.
  • [36] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
  • [37] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2–es, 2007.
  • [38] Benedek Rozemberczki, Ryan Davies, Rik Sarkar, and Charles Sutton. Gemsec: Graph embedding with self clustering. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pages 65–72. ACM, 2019.
  • [39] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005.
  • [40] P. A J Volf and F. M J Willems. Switching between two universal source coding algorithms. In Data Compression Conference, 1998. DCC ’98. Proceedings, pages 491–500, 1998.
  • [41] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • [42] Stanley J Kon. Models of stock returns-a comparison. The Journal of Finance, 39(1):147–165, 1984.
  • [43] Vasyl Golosnoy, Bastian Gribisch, and Roman Liesenfeld. The conditional autoregressive wishart model for multivariate stock market volatility. Journal of Econometrics, 167(1):211–223, 2012.
  • [44] Carlo Baldassi, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. Fast and accurate multivariate gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PloS one, 9(3):e92721, 2014.
  • [45] Gaël Varoquaux, Alexandre Gramfort, Jean-Baptiste Poline, and Bertrand Thirion. Brain covariance selection: better individual functional connectivity models using population prior. Advances in neural information processing systems, 23:2334–2342, 2010.
  • [46] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data.

    Journal of Machine learning research

    , 9(Mar):485–516, 2008.
  • [47] Robert Grone, Charles R Johnson, Eduardo M Sá, and Henry Wolkowicz. Positive definite completions of partial hermitian matrices. Linear algebra and its applications, 58:109–124, 1984.
  • [48] Arthur P Dempster. Covariance selection. Biometrics, pages 157–175, 1972.
  • [49] Lu Li and Kim-Chuan Toh. An inexact interior point method for l 1-regularized sparse covariance selection. Mathematical Programming Computation, 2(3-4):291–315, 2010.
  • [50] Kean Ming Tan, Palma London, Karthik Mohan, Su-In Lee, Maryam Fazel, and Daniela Witten. Learning graphical models with hubs. arXiv preprint arXiv:1402.7349, 2014.