I Introduction
Graphs can be used to represent complex dependencies between many variables. Graphbased analysis are used in many disciplines such as social networks analysis, natural language processing, chemistry, and bioinformatics. Realworld graphs can be large and expensive to store and communicate. Developing efficient methods to compress graphical data are of interest for storage and transmission of data. Classic information theory methods for coding deal with sequential data. Unlike sequential data with starting and stopping points (ordering), graphs are unordered structures. Since ordering is essential in coding, graph coding is challenging. This means that graphs cannot be efficiently compressed using traditional, sequencebased compression methods.
In this paper, we first present new universal source coding methods for the lossless compression of unlabeled, unweighted, undirected, simple graphs. Since the graphs are unlabeled, vertices do not have distinct identifications except through their interconnections. Reference [1] referred to such a graph as a graph structure. Our approach has two steps. First, inspired by Steinruecken’s method for coding of unordered i.i.d sequences [2], we transform a graph, , into an equivalent rooted binary tree, . Then, we use graph statistics from to develop two classes of coders for
. The first class utilizes local properties (i.e., formation of graph motifs) such as triangles and the second class uses degree distribution, as a global graph statistics, along with local properties for encoding. This step uses past information (encoded vertices) to encode the connections at the current vertex. This way we can build a probabilistic model based on graph statistics (either by learning or estimating statistics) to encode graphs. Our coders reflect more information about the structure of
and therefore give shorter codelengths compared to structural coding of [1] for graphs with more structure than ErdösRényi graphs.In the second half of the paper, we use graph coding for data analysis by using description length, which is the number of bits to describe data. Rissanen proposed using description length for model selection in his work on the minimum description length (MDL) [3, 4, 5]
. Rissanen’s MDL principle codifies model selection, which balances between how well a model describes the given data and the complexity of the model. Here, we develop a novel approach for model selection of Gaussian graphical models. We compute the description length of the conditional independence graph (the sparsity pattern of the precision matrix) using our lossless graph coding methods and the data under the assumed Gaussian distribution. The model that minimizes the summation of these two terms will be selected as the best model. Unlike other methods that may only consider the number of edges in conditional independence graph to account for model complexity, our approach considers the whole structure of the conditional independence graph by lossless compression.
We showed using synthetic and realworld data the advantages of our methods for both compression and Gaussian graphical model selection. Our approach outperforms competing source coding methods in two different scenarios: 1) compression of a single realworld graph, 2) compression of a graph from a particular type after learning the statistics of its type through training. We also compared our approach with common methods in the literature for model selection in Gaussian graphical models. The experimental results on synthetic data showed that our approach can recover true graph model of data with higher F1score. We also considered a realworld dataset containing extracted features from 12lead electrocardiogram (ECG) signals of a group of healthy people and a group of people with Kawasaki disease. We observed that there is a difference between graph model of healthy people and those with Kawasaki disease.
The paper is organized as follows. First, we provide an overview of previous works on universal compression methods for graphs in Section IA. In Section II, we describe our approach for transforming a graph structure into a rooted binary tree and show that the compression of rooted binary tree is same as the compression of graph structure. Then, we introduce two classes of universal coders based on graph statistics and provide experimental results on synthetic and realworld graphs. In Section III, we present the application of graph coding for model selection in Gaussian graphical models. We also give the performance results on synthetic data and then apply our approach to ECG data. We make concluding remarks and give some directions for future work in Section IV.
Ia Prior Work
Graph compression is a relatively new area in source coding. References [6, 7, 8, 9] focused on the entropy analysis of graph compression. Other papers have provided practical graph compression algorithms. These algorithms can be designed for the compression of either unlabeled graph (i.e., graph structure) or labeled graph (i.e., encoding vertices labels together with graph structure). In the case of unlabeled graph, the decoder recovers a graph that is isomorphic to the input graph. In the case of labeled graph, the decoder recovers the exact graph (i.e., graph structure with labels) at the expense of longer codewords. In other words, encoding of unlabeled graphs benefits from isomorphism since there are different labeled graphs that have the same structure. Fewer bits are required, in general, to encode an unlabeled graph compared to a labeled graph with the same structure [1].
Reference [1] was the first study to develop a compression algorithm to reach the entropy of the distribution on unlabeled ErdösRényi graphs up to the first two terms. In [10], two universal coding methods for arbitrary unlabeled graphs based on degree distribution and formation of triangles were introduced. The authors in [11] presented asymptotically optimal structural compression algorithms for the compression of both unlabeled and labeled preferential attachment graphs. The compression of dynamic graphs generated by duplication model was studied in [12]. The authors developed algorithms for the compression of both unlabeled and labeled versions of such graphs. Reference [13]
introduced an algorithm for the compression of deep feedforward neural networks by modeling them as bipartite graph layers. An algorithm for the compression of sparse labeled graphs with marks on vertices and edges was introduced in
[14]. A general survey on lossless graph compression methods can be found in [15].Previous compression methods are tailored to specific graph models and can perform poorly on other graph models or realworld graphs. Our approach provides a method to encode arbitrary unlabeled graph structures by building a probabilistic model based on graph properties (e.g., graph motifs and degree distribution). This general approach extracts more information from the graph structure than prior work and results in shorter codelength compared to other source coding methods.
The literature on model selection methods for Gaussian graphical models is rich. Existing methods can broadly fall into two main classes: informationbased methods and resampling methods. Informationbased methods, such as Bayesian information criteria (BIC) [16], Akaike information criteria (AIC) [17], and extended Bayesian information criteria (EBIC) [18], select the best model based on the loglikelihood of the data and the complexity of Gaussian graphical model. To account for model complexity, these methods typically consider the number of edges from the conditional independence graph, , induced by the Gaussian graphical model; while easy to obtain, this statistic gives only a rough approximation of the structure of . In contrast, our approach for model selection is to choose the best model based on the description length of the and the data when encoded with the resulting conditional independence graph [19].
Resampling methods measure the performance on outofsample data by splitting the data into into a subset of samples for fitting and use the remaining samples to estimate the efficacy of the model. The most common method in this class is crossvalidation (CV) [20]. Other methods are Generalized Approximate Crossvalidation (GACV) [21], Rotation Information Criterion (RIC) [22], and Stability Approach to Regularization Selection (StARS) [23]. The major shortcoming of resampling methods is their high computational cost since they require to solve the problem across all sets of subsamples.
A key aspect of model selection in Gaussian graphical models is the structure of the conditional independence graph. Resampling methods overlook this aspect while informationbased methods only consider simple graph statistics such as the number of edges in the conditional independence graph. Our approach presents a different perspective to account for a more accurate model complexity. This relies on being able to compute the description length of the conditional independence graph, , using our lossless graph compression methods.
Ii Graph Compression
Consider an unweighted, undirected, simple, unlabeled graph , where is the set of vertices and is the set of edges. In this paper, we present a method to compress a graph structure inspired by Steinruecken’s method for coding unordered i.i.d binary sequences [2]. We use the terms compression and coding interchangeably in this paper as they denote the same process. One can see our methods as extension of [1] to take into account more graph structure. The general idea is to randomly pick a vertex, , and encode information about all the incident edges of . A neighbor of is then picked and the information about its incident edges are encoded. To implement this scheme, an unlabeled graph is transformed into a rooted binary tree . Each level of associated with the vertex from to be encoded. An encoder is designed to encode . A decoder will decode without error to recover , which is identical to up to an automorphism of the nodes. In this paper, we introduce two broad classes of graph coders to efficiently encode the rooted binary tree .
Iia Definitions and Notation
Let denote the th vertex in . We will exclusively use vertices to refer to the nodes in and nodes to refer to the nodes in . The total number of vertices in G is . Let denote an edge between vertices and in . The degree of vertex is the total number of vertices connected to . The degree distribution
is the probability distribution of degrees of vertices in
and is an often used statistics to differentiate between different classes of random graphs. The adjacency matrix of is shown by , a matrix where if there is an edge between and in .The structure of the rooted binary tree is very important in the encoding process. We will divide the organization of into different levels (also known as depth). Let denote the th node of the th level of . Note that the root node is . Figure 2 shows our naming convention for all the other nodes. A node in will can contain multiple vertices from . The cardinality of the node is shown with . By default, .
Definition 1.
A node is a left node if it is the left child of a node in level
. In our convention, left nodes have odd index value
.Definition 2.
A node is a right node if it is the right child of a node in level . In our convention, right nodes have even index value .
Definition 3.
For a level , the first nonempty node refers to node :
(1) 
Often times, we may wish to explicitly refer to the parent or children node of . We use to refer to the parent node of . We use and to refer to left and right child of node . In the example tree shown in Figure 2, node , node , and node .
Let denote the path from node to the root node. We find levels where and both contain left nodes and store those levels in . We also find levels where or contain left nodes and store those levels in .
For example in Figure 2, and . We can see that and (node refer to ) both have left nodes at level (i.e., nodes and , respectively) and therefore, we get . We can also verify that as node is another left node in .
IiB Transforming Graph into Binary Tree
The rooted binary tree is built by traversing the unlabeled graph . Each node in corresponds to a set of vertices from . The root node corresponds to the set of all the vertices, . To build levels of the tree we pick a vertex ; we will specify shortly how it is picked. Vertex is removed from the set of vertices. The remaining vertices belonging to each node at level is split into two groups: the left node are all vertices connected to , and the right node contain those not connected to . Empty nodes are not split further. It is worth noting that the vertex is picked randomly from the first nonempty node at level , i.e., node . Figure 1 and Figure 2 show an example of and its corresponding rooted binary tree representation, , respectively. Since the order of nodes is irrelevant for the structure, here we have assumed .
Algorithm 1 gives the pseudocode for transforming graph into the rooted binary tree . Note that hereafter, when we traverse a level in the rooted binary tree , we start from the most left node and finish with the most right node at that level.
IiC Encoding the Rooted Binary Tree
As we are only interested in coding the structure of (i.e., vertices up to an automorphism), we do not need to explicitly encode the set of vertices corresponding to the nodes in . It is sufficient to encode only the cardinality of nodes; we call the tree with cardinality as node values (see Figure 3). When we refer to encoding a node , we are referring to encoding the value of the node.
Furthermore, we only need to encode the left nodes as the value of the siblings (i.e., right nodes) can be deduced given the value of the parent nodes. Consider a nonempty node , we can see that
(2) 
The reason for the discrepancy is because of our convention of always removing a vertex to encode at from the first nonempty node at each level.
It is easy to see that once the decoder has reconstructed the tree and consequently , one can reconstruct graph isomorphic to the original graph . The procedure is very similar to the one for transforming a graph into a binary tree. We start with isolated vertices as determined by the cardinality of the root in . Then for each level , we connect the vertex to vertices in left nodes at level of . The resulted graph is isomorphic to the original graph .
The remaining problem now is encoding the values in the tree . The better the values can be predicted, the shorter the codelength. The encoder/decoder can predict the value either based on global properties of the graph (e.g., degree distribution), or based on local properties, which can depend only on the part of the tree already encoded/decoded. One such property is that can only take on integer values or .
It is important to realize that the properties used for encoding the tree should be properties of the original graph, since they presumably have statistical significance. As an example, if the original graph is modeled as ErdösRényi [24]
, it is easy to see that the node values are binomially distributed,
, where(3) 
The encoder uses the global property , which must be known by the decoder. In universal coding, this can be transmitted to the decoder initially. Section IIF outlines how to calculate and encode global properties. We call this approach IID coder, as it is based on the i.i.d property of the ErdösRényi graph. It is about equivalent to the approach introduced in [1] for the compression of ErdösRényi graphs. The pseudocode for IID coder is shown in Algorithm 2.
In practice, ErdösRényi graphs are not good models for realworld graphs as edges are usually not independent [25]
. Therefore, we would like to use more advanced graph properties. We classify our coding methods broadly into two classes: 1) NodebyNode Coder and 2) LevelbyNode Coder. In NodebyNode Coder, we still use binomial distribution; however, the edge probability,
, will depend on local properties (i.e., graph motifs) in . In LevelbyNode Coder, we use the degree distribution of as a global property of graph to encode the values of all left nodes in level of at the same time.The value of a left node in can equivalently be seen as a count of edges of the original graph. Since only the number of edges are encoded, any local property used for encoding must be shared by all edges in a left node. Equivalently, any property of the original graph used for encoding must be convertible to a property of the tree . In other words, we convert the original graph into the tree , and any properties that we want to use for coding must then become properties purely of . We will describe this with more details for each class of coders.
IiD Class 1: NodebyNode Coder
For NodebyNode Coders, we will traverse back to the root to determine the existence of certain motif structures in . This will help us to better encode graphs that are not ErdösRényi as in these cases, edges are not independent of one another.
IiD1 Coding of Triangles
The first NodebyNode Coder we consider is the triangle coder. The triangle coder results in shorter codelength for graph classes that induce more triangle structures such as scalefree graphs. A triangle is a cycle graph with three nodes, which is also a 3clique. Statistics about triangles are often used to characterize graphs [26].
First, we describe how to deduce the existence of triangles in from the structure of . We know that the set of vertices corresponding to a left node are connected to the vertex . We can also deduce if there are edges between the set of vertices corresponding to the left node and the vertex . For that, we look at the ancestor of node in level . If it is a left node, then the set of vertices corresponding to the left node are connected to the vertex . To have a triangle, there must be an edge between and . It can be verified if the ancestor of at level is a left node. If that is the case, we can deduce a triangle forms between any of vertices corresponding to the left node , , and . For example, consider node in Figure 2. Since node is a left node in level , there is an edge between and . We can see that is also connected to because its ancestor in level (i.e., node ) is a left node. We can also verify that and are connected since node (i.e., node ) is a left node at level and therefore, connected to . Considering all these connections, we can deduce that form a triangle subgraph in . We can use the formation of triangles to encode the nodal values in as follows.
Similar to IID coder, we use binomial distribution. However, the triangle coder chooses between two binomial distributions: or , where is given by equation (3). We decide between these two distributions with the help of . Consider a left node , the coder will use to encode its value if has at least one element. Note that represents levels where and both contain left nodes. It means that vertices corresponding to left node and are both connected to vertex , and therefore, triangle forms among these vertices in . When is empty, we encode with as no triangle exists among these vertices in . The psuedocode for the triangle coder is given in Algorithm 3.
IiD2 Coding with the number of common neighbors
In a triangle subgraph, two connected vertices share a single common neighbor. But it may be that two vertices share multiple common neighbors as there is usually a correlation between having an edge between two vertices and the number of their common neighbors. This property is used for link prediction in complex networks [27, 28]. Therefore, we can generalize the triangle encoder to encoding with common neighbors. Instead of using or to parameterize the binomial distribution, we can use where denote the number of common neighbors (note that is the same as in coding with triangles). In other words, we find how many edges exist between any vertex corresponding to left node and vertex and utilize it as statistics that built the graph. The number of common neighbors can range from to the maximum degree of a vertex in , which we denote by .
The way we encode nodal values of with this coder is as follows. For a left node , we look at the cardinality of . It tells us how many vertices already encoded exist that any of vertices corresponding to and are both connected to them. This determines which parameter to pick for binomial distribution.
IiD3 Coding over 4node motifs
We can explore motifs of larger sizes in to develop new coders. However, computational cost can be a limiting factor. Here, we are interested to extend coding to motifs of size 4 vertices. Figure 4 illustrates motifs that we look for. Note that the priority starts with 4clique. If 4clique does not exist, then we look for doubletriangle, and finally 4cycle. Each of these motifs can be encoded with , , and , respectively. If none of them exist, then we encode left node using .
As we mentioned earlier, we find the appropriate parameter to encode nodal values of by looking at the structure of encoded/decoded part of tree up to that point. Depending on the cardinality of , different cases are possible:

For any pair and in where , we check to see if we can find a case in which the first nonempty node in level (i.e., ) lies in the left subtree of the tree rooted at each node in level . If such a case exists, we use to encode the value of . Otherwise, the configuration 3 of doubletriangle in Figure 4 occurred; thus, we use .

Two cases are possible

For the first nonempty node in level , and node , we check to see if and contain a left node at the same level. If such a node exists, the configuration 1 of doubletriangle in Figure 4 occurred; thus, we use .

For node and , we check to see if and contain a left node at the same level. If such a node exists, the configuration 2 of doubletriangle in Figure 4 occurred; thus, we use .


First, we need to find which contains all the levels that or have left nodes. For and any pair and in where , we look for a case in which the first nonempty node in level (i.e., ) lies in the left subtree of the tree rooted at any node in level . If such a case exists, we use to encode the value of .

If none of the abovementioned cases occurred, we use to encode the value of .
Due to similarity, we skip the pseudocode for this coder to avoid repetition.
IiE Class 2: LevelbyNode Coder
One important graph statistics is the degree distribution . Class 2 coders utilize the degree distribution in addition to graph motifs presented in Class 1 coders to efficiently encode .
For a given level, , in , Class 1 coders encode the value of each left node (and deduces the value of the right node) independently of one another. Class 2 coders on the other hand, utilize the degree distribution to encode the values of left nodes at the same level altogether.
The number of left nodes encountered in , , is equal to the number of edges, , in . We know that the degree of vertex is lower bounded by . The sum of the values of left nodes in level is , where is the degree of vertex . We can use this relationship to encode all the left nodes in level of at the same time.
Encoding with Class 2 entails two steps. First, we need to send the degree of . Then, we send the conditional probability of seeing specific values for left nodes at level conditioned on their summation to be .
For the first step, we use the degree distribution and the fact that . Thus, we encode the degree of with
(4) 
One should notice that at the time of decoding a given level, the decoder knows based on the tree up that point and thus, it can calculate (4). Now, the decoder can compute the summation of the left nodes’ values at level with
(5) 
Let denote last left node in level . To encode left nodes’ values in level , we need to compute the joint probability of observing specific values for left nodes at level (i.e., left nodes take values respectively) conditioning on their summation being equal to (5).
(6) 
where the numerator is the joint probability of observing as the values of left nodes and the denominator is the probability corresponding to all configurations with the same summation for left nodes’ value, . By independent assumption on the values of left nodes, the joint probability distribution will reduce to the product of individual probabilities.
Assuming the independence assumption, two cases are possible:

Identically distributed: This case happens when we want to compute (IIE) using IID coder. Then, the probability of success is the same for all probabilities and the problem reduces to a counting problem. We can encode the values of left nodes using the probability distribution
(7) 
Not identically distributed: If we use any other coders except IID coder, then different left nodes in the same level do not have the same parameter for encoding. For example as we showed with the triangle coder from Class 1, node in Figure 3 would be encoded with while node would be encoded with . To take this into account, we need to encode the values of left nodes using the following probability distribution
(8) where is the relevant parameter for each left node in the tree (e.g., or when using triangles for coding).
The denominator in (2) is a generalized version of binomial distribution called Poisson binomial distribution [29]. It is defined as the sum of independent binomial distributions that are not necessarily identically distributed. Due to the involvement of different probabilities, calculation of denominator in (2) can be cumbersome. To resolve this, some methods were proposed in the literature. Recursive methods were developed in [30, 31] that can compute the denominator in (2) in time. We pay this extra computational cost to have a more efficient coder since the entropy of Poisson binomial distribution is bounded above by the entropy of binomial distribution with the same mean [32]. The pseudocode to encode with Class 2 is given in Algorithm 4.
IiF Calculation and Encoding of Statistics
We consider encoding in two scenarios: learned coding, and universal coding. In learned coding, we are given a set of training graphs of a particular class and have to learn local and global statistics; these statistics, then, are shared to both encoder and decoder. In universal coding, there is no training set and the encoder encodes a single graph. It also has to communicate to the decoder what is the statistics. Below we describe calculation and communication of statistics for each of these scenarios.
IiF1 Learned Coding
For learned coding, we need to learn statistics from a set of training graphs. To do that, each statistic is calculated by taking an average over the same statistic in the training set. The edge probability in coding with IID coder can be estimated by the average degree. Other edge statistics are more tricky and should reflect the procedure used for encoding the graph. It means that we cannot simply count the number of triangles in a graph to compute . The reason is when we want to encode a graph with triangles, we look for the formation of a triangle in a specific way forced by the coding algorithm as described earlier. Thus, we need to transform each graph from the training set into its rooted binary tree representation and find statistics with coding algorithm. To estimate and , we traverse each the same way we did for coding and divide the tree’s nodes into those coded with and those coded with . To estimate each of and , we compute the ratio of the summation of left nodes’ values to the summation of left and right nodes’ values in that group. The average over all gives the estimation for and . The same recipe is used to estimate edge statistics for common neighbors and 4node motifs.
The estimation of degree distribution in Class 2 is straightforward. It can be estimated through the histogram. We compute the degree distribution for each graph in the training set and the final degree distribution is estimated by taking average over them.
IiF2 Universal Coding
For encoding average degree in IID coder, we can send the number of edges in . The number of bits required to encode the number of edges is about bits. Once the decoder knows the number of edges, it can compute the parameter of IID coder, , with
For other local statistics, we use sequential estimation best outlined in [33, Section 13.2]. For example in coding with triangles, we use sequential estimation of and , specifically the KT estimator [34, 35], which is
where are the summation of left and right nodes’ values previously coded in the rooted binary tree, respectively. One should note that the procedure for updating the probabilities and is different for each class. For Class 1 coders, the probabilities will be updated after coding each node of the rooted binary tree. However, for Class 2 coders, the probabilities will be updated after coding each level. The reason is that all nodes at the same level will be encoded together. To estimate statistics for common neighbors and 4node motifs, we utilize similar sequential approach used for the estimation of and .
For the degree distribution, we calculate the degree histogram for the whole graph, and use this for coding. The degree of a node is between 0 and . We can therefore think of the degree histogram as putting each of the (unlabeled) nodes into one of buckets, and encoding this can be done by encoding the counts in the buckets. The number of possible configurations is a standard problem in combinatorics: , which can be transmitted with
Hereinafter, in our experiments, we use the above mentioned approach to calculate and communicate local and global statistics of graphs.
IiG Experiments
We consider two cases for experiments: 1) using learned coding to encode a graph of the same class of training set, 2) using universal coding to encode a single graph. In the former case, we generate synthetic data for different classes of graphs, whereas in the later case we measure the performance on realworld graphs. We evaluate the performance of our coding methods versus IID coder and also compare the performance of Class 1 and Class 2 against each other.
For learned coding, we first need to learn statistics for coders. We considered different classes of graphs in our experiments. In all cases, learning was done on 50 graphs. Then, we used those statistics from the training to encode a test graph of the same type. The results are shown in Figure 5. As expected, for ErdösRényi graph, IID coder is efficient and all other coders do not offer an improvement. However, for BarabásiAlbert and Watts Strogatz graphs, our proposed coders outperform IID coder by a significant margin. One can observe that coders in Class 2 have shorter codelength than their counterparts in Class 1. For BarabásiAlbert graph, all encoders in Class 2 almost have the same performance and therefore, choosing one or another does not matter. However, for Watts Strogatz graph, coding with common neighbors through Class 2 is the most efficient.
For compression of realworld graphs, we use universal coding since there is no training set. Note that the statistics associated with each coder is required to be communicated to the decoder as described in Section IIF. We measure the compression performance of Class 1 and Class 2 coders against each other and also compute the codelength associated with IID coder and arithmetic coding (represented by labeled iid) for comparison. The following undirected graphs, publicly available to download ^{1}^{1}1http://networkrepository.com/^{,}^{2}^{2}2https://snap.stanford.edu, are considered:

Genetic interaction: a gene interaction network [36].

Economic Network: the economic network of Victoria, Australia in 1880 [36].

Transportation network: A simple graph of Minnesota road network [36].

Collaboration network: extracted from arXiv papers and covers collaborations among authors in General Relativity and Quantum Cosmology category [37].

Facebookpolitician: represents blue verified Facebook page networks of politicians [38].

Internet (AS level): An undirected graph representing AS peering information inferred from University of Oregon Route Views Project on January 1, 2000 [39].
As we can see from Table I, coders in Class 2 give a better performance compared to their counterparts in Class 1. This improvement comes at the cost of computing (2). Moreover, shorter codelength is achievable by using Class 1 over IID coder. Another observation is that coding with common neighbors by means of Class 2 outperforms other coders in most cases. However, there are some cases that coding over 4node motifs and triangles offer improvement over common neighbors. There is no single coder that is best for all graphs. This is no different than the situation for encoding sequences. If one wants the most efficient coder for a particular graph, one can encode with multiple algorithms, and choose the shortest, indicating by a header which is used or use soft combining [40]. In the end it is a tradeoff between compression and complexity.
Name  Labeled iid  IID coder  Class 1  Class 2  

Triangle  ComNei  4node  Degree  Triangle  ComNei  4node  
Genetic interaction 
4227  39484  365599  317379  273444  235011  249467  280375  235787  206485  220191 
Economic network  1258  7513  61224  47592  49790  49597  46407  45500  44159  43996  40730 
Transportation network  2642  3303  37937  35424  14619  14620  14596  12712  12530  12531  12540 
Collaboration network  5242  14496  164224  119038  60375  52041  66203  106215  53647  47320  60182 
Facebookpolitician  5908  41729  423451  340070  222500  199514  206720  327203  195739  172386  179836 
Internet (AS level)  3570  7750  86206  47685  48266  46665  47222  31107  30469  29708  30897 

Iii Model Selection in Gaussian Graphical Models
In this section, we introduce an application of graph coding for model selection in Gaussian graphical models. The goal is to find a graph that describes dependencies among multivariate Gaussian variables accurately [41]
. Multivariate Gaussian distributions are widely used in modeling realworld problems where the relationship among approximately normally distributed variables is of interest. They have been used to model different datasets such as stock returns
[42, 43], proteinprotein interactions [44], and brain functional activities [45].Let be a
dimensional random vector with multivariate Gaussian distribution
. The inverse of covariance matrix, , is known as the precision matrix. If the entry of the precision matrix (i.e., ) is , then and are conditionally independent given all the other variables. Therefore, we can visualize the conditional independence relationships between any pair of variables as an unweighted, undirected graph, whose adjacency matrix, is such that(9) 
The graph is known as the conditional independence graph. Its structure is determined by the sparsity pattern of .
Often, we are interested in the problem of estimating from i.i.d observations. To do that, we need to estimate first. The problem is especially difficult when . In this case, the maximum likelihood estimate does not exist. In most approaches, the estimator of contains a regularization term to 1) prevent overfitting, 2) control the sparsity of the estimated solution. A popular sparse estimator of the precision matrix is found by maximizing the penalized loglikelihood [41, 16, 46]
(10) 
where is the sample covariance matrix, and is the regularization parameter that controls sparsity. The performance of the estimator in (10) is highly influenced by the selection of . Depending on the value of , the conditional independence graph can range from a dense graph for small values of to a graph with zero edges when takes large values. To find the best value of the regularization parameter , model selection techniques are being used.
We use the MDL principle and select that minimizes the sum of the description length of the conditional independence graph structure and data when encoded under that graph structure
(11) 
where is the description length of the conditional independence graph , and is the description length of data when encoded with .
To compute the description length of the conditional independence graph, , we first need to find the underlying conditional independence graph . By applying an estimator to the sequence of i.i.d observations from variate Gaussian distribution, we obtain an estimated precision matrix for each realization of . We have dropped the explicit dependence on from and other related precision and covariance matrices for the sake of simplicity of notation. In this paper, we use graphical lasso [41], which is one of the most commonly used solvers of penalized loglikelihood in Gaussian graphical models. It is worth mentioning that our approach can be easily applied to any other estimator. The conditional independence graph is obtained from the estimated precision matrix . We then pick a graph coder described in Section II to compute the description length of .
To compute the description length of the data, , we face two challenges. The first challenge is to deal with realvalued data where lossless source coding is not generalized directly. Second, we have to encode the data based on the underlying conditional independence graph . To encode realvalued data, we assume a fixedpoint representation with a (large) finite number, , bits after the period, and an unlimited number of bits before the period [3]. Therefore, the number of bits required to represent data according to the probability distribution function (pdf) is given by
(12) 
Since we are only interested in relative codelengths between different models, the dependency on cancels out.
The second challenge is how to encode the data with respect to . As mentioned, the data is generated by multivariate Gaussian distribution, which can be characterized by covariance matrix . Since we do not have access to the true covariance matrix , we can find an estimate based on the sample covariance matrix . It should be noted that the structure of should match with the structure of obtained from the previous step
(13) 
This problem is known as matrix completion problem [47]. Dempster in [48], proved the existence and uniqueness of maximum likelihood solution for when the sample covariance matrix is positive definite. He presented a coordinatedescent algorithm to find the solution iteratively. It should be noted that the precision matrix obtained from the graphical lasso solution is not necessarily equal to the inverse of estimated covariance matrix .
Once we estimate , we use predictive MDL [5] to compute
(14) 
where is the conditional pdf and denote the maximum likelihood estimate of parameters, which in this case is the estimated covariance obtained under . The codelength in (14) is a valid codelength since it is sequentially decodable. Note that it does not work for the first few samples, as there is no estimate for . Instead, we encode the first few samples with a default distribution, which is the same among different realizations of .
Finally, the best conditional independence graph structure is obtained by minimizing over . In the following, we present an algorithm to find the best graph model of data associated with .
Input: Samples and regularization parameters .
Output: Best graph model of data .
Iiia Experiments
We tested our approach on both synthetic and realworld data. For synthetic data, we consider different conditional independence graph structures and the goal is to recover true conditional independence graph. For realworld data, we applied our approach to ECG data of a group of healthy subjects and a group of subjects with Kawasaki disease. We are interested to determine if there is any difference between the conditional independence graph of healthy group versus Kawasaki group.
We now provide simulation results on synthetic data generated from zeromean multivariate Gaussian distribution with known precision matrix, . We will compare the performance of graph coding methods against other methods in the recovery of conditional independence graph
. We use F1score as a widely used metric in the literature. F1score is the harmonic mean of precision and recall where the precision is the ratio of the number of correctly estimated edges to the total number of edges in the estimated conditional independence graph, and the recall is the ratio of the number of correctly estimated edges to the total number of edges in the true conditional independence graph
[23]. We consider two cases: 1) the number of observations is larger than the number of variables, , with and 2) the number of observations is smaller than the number of variables, , with . The experiments were repeated for . We applied the graphical lasso estimator described in [41] for when and when to cover a wide range of structures from dense graphs to totally isolated nodes.Type  Optimum  Benchmark methods  IID  Class 1  Class 2  
CV  BIC  EBIC  Triangle  ComNei  4node  Degree  Triangle  ComNei  4node  
Cycle  100  200  1  0.26  0.64  0.75  1  0.99  0.99  1  1  1  1  1 
Cycle  200  400  1  0.23  0.61  0.69  1  1  1  1  1  1  1  1 
AR(1)  100  200  0.99  0.26  0.65  0.75  0.65  0.99  0.99  0.99  0.99  0.99  0.99  0.99 
AR(1)  200  400  1  0.23  0.62  0.70  0.99  0.99  0.99  1  1  1  1  1 
ER  100  200  0.71  0.28  0.63  0.40  0.68  0.69  0.70  0.70  0.70  0.70  0.70  0.70 
ER  200  400  0.80  0.27  0.68  0.75  0.79  0.78  0.78  0.79  0.79  0.79  0.79  0.79 
Hub  100  200  0.50  0.23  0.48  0.01  0.47  0.48  0.48  0.48  0.49  0.49  0.49  0.49 
Hub  200  400  0.44  0.22  0.43  0.16  0.43  0.43  0.43  0.43  0.44  0.44  0.44  0.44 
Type  Optimum  Benchmark methods  IID  Class 1  Class 2  

CV  BIC  EBIC  Triangle  ComNei  4node  Degree  Triangle  ComNei  4node  
Cycle  100  50  0.89  0.28  0.26  0.74  0.81  0.89  0.89  0.89  0.89  0.89  0.89  0.89 
Cycle  200  100  0.97  0.23  0.29  0.69  0.76  0.96  0.96  0.96  0.97  0.96  0.96  0.96 
AR(1)  100  50  0.89  0.27  0.25  0.74  0.81  0.89  0.89  0.89  0.89  0.89  0.89  0.89 
AR(1)  200  100  0.97  0.23  0.28  0.69  0.73  0.95  0.95  0.97  0.97  0.96  0.96  0.97 
ER  100  50  0.54  0.28  0.26  0.40  0.47  0.49  0.49  0.49  0.36  0.46  0.46  0.45 
ER  200  100  0.66  0.29  0.34  0.63  0.63  0.64  0.64  0.64  0.58  0.63  0.63  0.63 
Hub  100  50  0.32  0.18  0.17  0.02  0.25  0.15  0.15  0.13  0.22  0.25  0.24  0.24 
Hub  200  100  0.31  0.15  0.20  0.03  0.25  0.13  0.13  0.13  0.26  0.27  0.27  0.27 
The reason for considering smaller range for is the fact that the graphical lasso estimator does not converge for in our test cases. The values of are equally spaced apart with the step size of . Multivariate Gaussian data was generated with different precision matrix structures that have been frequently used as test cases in the literature [49, 50]:

Cycle structure with .

Autoregressive process of order one AR(1) with .

ErdösRényi (ER) structure with where
is a matrix with offdiagonal elements taking values randomly chosen from uniform distribution
with the probability of and diagonal values set to zero. To keep positive definite, we choose whereis the absolute value of the minimum eigenvalue of
. Hereis the identity matrix of size
. 
Hub structure with two hubs. First, we create the adjacency matrix by setting offdiagonal elements to one with the probability of and zero otherwise. Next, we randomly select two hub nodes and set the elements of the corresponding rows and columns to 1 with the probability of and zero otherwise. After that for each nonzero element , we set with a value chosen randomly from uniform distribution . Then, we set . The final precision matrix is obtained by with the same setup as ErdösRényi structure.
Table II and Table III show the results of applying different methods to recover the conditional independence graph by applying graphical lasso as the estimator for and cases, respectively. To have a groundtruth for comparison, we provide the highest possible value for F1score on the regularization path and show it in the column with name Optimum. For benchmark methods, we considered CV, BIC, and EBIC methods. The results for CV are given for 5fold CV, and for EBIC method are given for recommended value of [18]. The results for graphcoding methods are obtained by following the steps outlined in Algorithm 5. The values are the average of 50 Monte Carlo trials.
It can be seen that graphcoding techniques, i.e., columns named with IID, Class 1, and Class 2, outperform other methods in most of cases. Furthermore, F1score associated with recovered conditional independence graph by these methods is so close to the optimum value on the regularization path. Among graph coding methods, we observed that coding with Class 1 and Class 2 coders give better performance than coding with IID coder in most of the cases. In addition, often times coders in Class 2 may offer a slight improvement over their counterparts in Class 1 coders. These findings confirm the necessity of having more advanced graph coders that reflect more information about the graph in their codelength.
We also tested our approach on a realworld dataset. This dataset contains extracted features from 12lead ECG signals of a group of healthy subjects and a group of subjects with Kawasaki disease. All subjects are of age one to three years. The healthy group has samples and the Kawasaki group has samples. This dataset was initially processed to select features with empirical distribution close to normal distribution. The reason is that the underlying assumption in Gaussian graphical models is that the data is normally distributed and this is also a key assumption for encoding data under our approach. After screening all features and selecting those with approximately normal distribution, we ended up with ECG features. We applied our approach to find the graph model of data for each group of subjects when takes equally spaced apart values with the step size of within the range . The results show that the graph model of healthy subjects differs from the graph model of subjects with Kawasaki disease as given in Figure 6. The features with a number from to (referring to the lead number) at the end of their name vary across different leads. The rest of features are the same across all leads. The conditional independence graph given by our approach for each group of subjects is the same across different graph coders. As it can be seen, the conditional independence graph of healthy subjects is sparser with only edges compared to edges in the conditional independence graph of Kawasaki subjects. Moreover, these two graphs share edges.
Iv Conclusion
In this paper we developed some universal coders based on graph statistics for the compression of unlabeled graphs. We achieved this by first transforming graph structure into a rooted binary tree and showing that the compression of graph structure is equivalent to the compression of its associated rooted binary tree. Then, we used graph statistics to develop two main classes of graph coders. The first class uses graph motifs as local properties and the second class utilizes degree distribution as a global statistics along with graph motifs for coding. We introduced an application of graph coding for model selection in Gaussian graphical models.
For future work, we will extend this work in two directions. First is to extend graph compression to undirected graphs with attributes where graph topology will be utilized for the compression of structure and attributes. Second is to extend graph coding to directed graphs and using developed coders for data analysis purposes. Similar to undirected graphs, the idea is to transform directed graph into a rooted tree and encode nodal values on the rooted tree.
References
 [1] Yongwook Choi and Wojciech Szpankowski. Compression of graphical structures: Fundamental limits, algorithms, and experiments. IEEE Transactions on Information Theory, 58(2):620–638, 2012.
 [2] C. Steinruecken. Compressing sets and multisets of sequences. IEEE Transactions on Information Theory, 61(3):1485–1490, March 2015.
 [3] Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of statistics, 11(2):416–431, 1983.
 [4] Jorma Rissanen. Modeling by shortest data description. Automatica, pages 465–471, 1978.
 [5] Jorma Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080–1100, 1986.
 [6] Tomasz Luczak, Abram Magner, and Wojciech Szpankowski. Structural information and compression of scalefree graphs. on internet, 2017.
 [7] T. Luczak, A. Magner, and W. Szpankowski. Asymmetry and structural information in preferential attachment graphs. ArXiv eprints 1607.04102, July 2016.
 [8] Amir R. Asadi, Emmanuel Abbe, and Sergio Verdu. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587, June 2017.
 [9] Payam Delgosha and Venkat Anantharam. Universal lossless compression of graphical data. IEEE Transactions on Information Theory, 66(11):6962–6976, 2020.

[10]
Anders HostMadsen and June Zhang.
Coding of graphs with application to graph anomaly detection.
In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1829–1833. IEEE, 2018.  [11] Tomasz Łuczak, Abram Magner, and Wojciech Szpankowski. Compression of preferential attachment graphs. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1697–1701. IEEE, 2019.
 [12] Krzysztof Turowski, Abram Magner, and Wojciech Szpankowski. Compression of dynamic graphs generated by a duplication model. Algorithmica, 82(9):2687–2707, 2020.
 [13] Sourya Basu and Lav R Varshney. Universal and succinct source coding of deep neural networks. arXiv preprint arXiv:1804.02800, 2018.
 [14] Payam Delgosha and Venkat Anantharam. A universal low complexity compression algorithm for sparse marked graphs. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2349–2354. IEEE, 2020.
 [15] Maciej Besta and Torsten Hoefler. Survey and taxonomy of lossless graph compression and spaceefficient graph representations. arXiv preprint arXiv:1806.01799, 2018.
 [16] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
 [17] Patricia Menéndez, Yiannis AI Kourmpetis, Cajo JF ter Braak, and Fred A van Eeuwijk. Gene regulatory networks from multifactorial perturbations using graphical lasso: application to the dream4 challenge. PloS one, 5(12):e14147, 2010.
 [18] Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. In Advances in neural information processing systems, pages 604–612, 2010.
 [19] Mojtaba Abolfazli, Anders HostMadsen, June Zhang, and Andras Bratincsak. Graph coding for model selection and anomaly detection in gaussian graphical models. arXiv preprint arXiv:2102.02431, 2021.
 [20] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
 [21] Heng Lian. Shrinkage tuning parameter selection in precision matrices estimation. Journal of Statistical Planning and Inference, 141(8):2839–2848, 2011.
 [22] Shaun Lysen. Permuted inclusion criterion: a variable selection technique. Publicly accessible Penn Dissertations, page 28, 2009.
 [23] Han Liu, Kathryn Roeder, and Larry Wasserman. Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in neural information processing systems, pages 1432–1440, 2010.
 [24] Béla Bollobás. Random Graphs. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2001.
 [25] Réka Albert and AlbertLászló Barabási. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
 [26] AlbertLászló Barabási. Network science. Cambridge university press, 2016.
 [27] Lin Yao, Luning Wang, Lv Pan, and Kai Yao. Link prediction based on commonneighbors for dynamic social network. Procedia Computer Science, 83:82–89, 2016.
 [28] Shibao Li, Junwei Huang, Zhigang Zhang, Jianhang Liu, Tingpei Huang, and Haihua Chen. Similaritybased future common neighbors model for link prediction in complex networks. Scientific reports, 8(1):1–11, 2018.
 [29] Yuan H Wang. On the number of successes in independent trials. Statistica Sinica, pages 295–312, 1993.
 [30] XiangHui Chen, Arthur P Dempster, and Jun S Liu. Weighted finite population sampling to maximize entropy. Biometrika, 81(3):457–469, 1994.
 [31] R. E. Barlow and K. D. Heidtmann. Computing koutofn system reliability. IEEE Transactions on Reliability, R33(4):322–323, 1984.

[32]
Peter Harremoës.
Binomial and poisson distributions as maximum entropy distributions.
IEEE Transactions on Information Theory, 47(5):2039–2041, 2001.  [33] T.M. Cover and J.A. Thomas. Information Theory, 2nd Edition. John Wiley, 2006.
 [34] R. Krichevsky and V. Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207, Mar 1981.
 [35] F. M J Willems, Y.M. Shtarkov, and T.J. Tjalkens. The contexttree weighting method: basic properties. Information Theory, IEEE Transactions on, 41(3):653–664, 1995.
 [36] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
 [37] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2–es, 2007.
 [38] Benedek Rozemberczki, Ryan Davies, Rik Sarkar, and Charles Sutton. Gemsec: Graph embedding with self clustering. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pages 65–72. ACM, 2019.
 [39] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005.
 [40] P. A J Volf and F. M J Willems. Switching between two universal source coding algorithms. In Data Compression Conference, 1998. DCC ’98. Proceedings, pages 491–500, 1998.
 [41] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
 [42] Stanley J Kon. Models of stock returnsa comparison. The Journal of Finance, 39(1):147–165, 1984.
 [43] Vasyl Golosnoy, Bastian Gribisch, and Roman Liesenfeld. The conditional autoregressive wishart model for multivariate stock market volatility. Journal of Econometrics, 167(1):211–223, 2012.
 [44] Carlo Baldassi, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. Fast and accurate multivariate gaussian modeling of protein families: predicting residue contacts and proteininteraction partners. PloS one, 9(3):e92721, 2014.
 [45] Gaël Varoquaux, Alexandre Gramfort, JeanBaptiste Poline, and Bertrand Thirion. Brain covariance selection: better individual functional connectivity models using population prior. Advances in neural information processing systems, 23:2334–2342, 2010.

[46]
Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont.
Model selection through sparse maximum likelihood estimation for
multivariate gaussian or binary data.
Journal of Machine learning research
, 9(Mar):485–516, 2008.  [47] Robert Grone, Charles R Johnson, Eduardo M Sá, and Henry Wolkowicz. Positive definite completions of partial hermitian matrices. Linear algebra and its applications, 58:109–124, 1984.
 [48] Arthur P Dempster. Covariance selection. Biometrics, pages 157–175, 1972.
 [49] Lu Li and KimChuan Toh. An inexact interior point method for l 1regularized sparse covariance selection. Mathematical Programming Computation, 2(34):291–315, 2010.
 [50] Kean Ming Tan, Palma London, Karthik Mohan, SuIn Lee, Maryam Fazel, and Daniela Witten. Learning graphical models with hubs. arXiv preprint arXiv:1402.7349, 2014.