1 Introduction
In many data science applications, data appears in the form of largescale graphs. For example, in social networks, vertices represent users and an edge between vertices represents friendship; in World Wide Web, vertices are websites and edges indicate the hyperlinks from one site to the other; in biological systems, vertices can be proteins and edges illustrate proteintoprotein interaction. Such graphs may contain billions of vertices. In addition, edges tend to be correlated with each other since, for example, two people sharing many common friends are likely to be friends as well. How to efficiently compress such largescale structural information to reduce the I/O and communication costs in storing and transmitting such data is an emerging challenge in the era of big data.
The literature on graph compression is vast. Existing compression schemes follow various different methodologies. Several methods exploited combinatorial properties such as cliques and cuts in the graph [1, 2]. Many works targeted at domainspecific graphs such as web graphs [3], biology networks [4, 5], and social network graphs [6]. Various representations of graphs were proposed, such as the textbased method, where the neighbor list of each vertex is treated as a “word” [7, 8], and the tree method, where the adjacency matrix is recursively partitioned into equalsize submatrices [9]. Succinct graph representations that enable certain types of fast computation, such as adjacency query or vertex degree query, were also widely studied [10]. While most compression schemes are for labeled graphs, there are also works considering lossless compression of unlabeled graphs [11, 12, 13], an evolving sequence of graphs [14, 15], or (correlated) data on the graph [16, 17]. We refer the readers to [18] for an exhaustive survey on lossless graph compression and spaceefficient graph representations.
In this paper, we take an information theoretic approach to study lossless compression of a graph. We assume the graph is generated by some random graph model and investigate lossless compression schemes that achieve the theoretical limit, i.e., the entropy of the graph, asymptotically as the number of vertices goes to infinity. When the underlying distribution/statistics of the random graph model is known, optimal lossless compression can be achieved by methods like Huffman coding. However, in most realworld applications, the exact distribution is usually hard to obtain and the data we are given is a single realization of this distribution. This motivates us to consider the framework of universal compression, in which we assume the underlying distribution belongs to a known family of distributions and require that the encoder and the decoder should not be a function of the underlying distribution. For this paper, we focus on the family of stochastic block models, which are widely used random graph models that capture the clustering effect in social networks. Our goal is to develop a universal graph compressor for a family of stochastic block models with as wide range of parameters as possible.
Universal compression for onedimensional sequences is a wellstudied area, especially for the family of independent and identically distributed (i.i.d.) processes and the family of stationary ergodic processes. A large number of universal compressors have been proposed for sequences, such as the Laplace and Krichevsky–Trofimov (KT) compressor for i.i.d. processes [19, 20], Lempel–Ziv [21, 22] and Burrows–Wheeler transform[23] for stationary ergodic processes, and context tree weighting [24] for finite memory processes. Many of these have been adopted in standard data compression applications such as compress, gzip, GIF, TIFF, and bzip2. A natural question arising here is: Can we convert the twodimensional adjacency matrix of the graph into a onedimensional sequence in some order and apply a universal compressor for the sequence? For some simple graph model such as Erdős–Rényi graph, where each edge is generated i.i.d. with probability , this would indeed work. For more complex graph models including stochastic block models, where edges are correlated to each other, it is unclear whether there is an ordering of the entries that results in a stationary process. We will show in Section 5 several orders including rowbyrow, columnbycolumn, and diagonalbydiagonal fail to produce a stationary process. On the other hand, in certain regime of stochastic block models (as shown in Theorem 2), we manage to establish the universality of Laplace and KT compressors using this approach through a generalization of the analysis from i.i.d. processes to identical but not necessarily independent processes. For experiments, we implement the ordering of Peano–Hilbert space filling curve and apply Lempel–Ziv compressor [25] in four benchmark graph datasets. Its compression length turns out to be 2.5 to 4 times the length of our proposed algorithm.
Lossless compression for stochastic block models was first studied by Abbe [16]. The focus there is twofold: 1) compute the entropy of the stochastic block model; 2) explore the relation between community detection and compression. Several interesting questions were presented: Knowing the community labels will help compression, since edges can be grouped into i.i.d. subsets. But is community detection necessary for compression? In the regime when community detection is not possible, how do we compress the graph? We answer these questions in this paper by presenting a universal compressor that does not require knowledge of the edge probabilities, the community labels, or the number of communities. Our compressor remains universal even in the regime when community detection is not possible. As a consequence, universal compression can be an easier task than community detection for stochastic block models.
The rest of the paper is organized as follows. In Section 1.1, we define universality over a family of graph distributions and the stochastic block models. We present our main result in Section 1.2, which is a graph compressor that is universal for a family containing most nontrivial stochastic block models. We describe the encoding procedure of the proposed graph compressor in Section 2.1. We implement our compressor in four benchmark graph datasets and compare its empirical performance to four competing algorithms in Section 2.2. We illustrate key steps in establishing universality in Section 3 and elaborate the proof of each step in Section 4. Section 5 explains why existing universal compressors developed for stationary processes may not be immediately applicable for some onedimensional ordering of entries in the adjacency matrix.
Notation. For an integer , let . Let . We follow the standard order notation: if ; if ; if and ; if ; if ; and if .
1.1 Problem Setup
For simplicity, we focus on simple (undirected, unweighted, no selfloop) graphs with labeled vertices in this paper. But our compression scheme and the corresponding analysis can be extended to more general graphs. Let be the set of all labeled simple graphs on vertices. Let be the set of binary sequences of length , and set . A lossless graph compressor is a onetoone function that maps a graph to a binary sequence. Let denote the length of the output sequence. When is generated from a distribution, it is known that the entropy is a fundamental lower bound on the expected length of any lossless compressor [26, Theorem 8.3]
(1) 
and therefore
Thus, a graph compressor is said to be universal for the family of distributions if for all distribution and , we have
(2) 
A stochastic block model
defines a probability distribution over
. Here is the number of vertices, is the number of communities. Each vertex is associated with a community label . The lengthcolumn vector
is a probability distribution over , where indicates the probability that any vertex is assigned community . is an symmetric matrix, where represents the probability of having an edge between a vertex with community label and a vertex with community label . We say if the community labels are generated i.i.d. according to and for every pair , an edge is generated between vertex and vertex with probability . In other words, in the adjacency matrix of the graph, for ; the diagonal entries for all ; and for . We assume all the entries in are in the same regime and write , where is an symmetric matrix with constant entries for all . We assume all entries in are . We will consider two families of stochastic block models: For ,(3)  
(4) 
Note that the edge probability is the threshold for a random graph to contain an edge with high probability [27]. Thus, the family covers most nontrivial SBM graphs. Clearly, is a strict subset of , as it does not contain the constant regime .
1.2 Main Results
Theorem 1 (Universality over ).
For every , the graph compressor defined in Section 2.1 is universal over the family provided that
Theorem 2 (Universality over ).
For every , the graph compressor defined in Section 2.1 is universal over the family .
2 Algorithm
2.1 Universal Graph Compressor
For each that divides , the graph compressor is defined as follows.

Block decomposition. Let . For , let be the submatrix of formed by the rows and the columns . For example, we have
(5) We then write in the blockmatrix form as
(6) Denote
(7) as the sequence of offdiagonal blocks in the upper triangle and
(8) as the sequence of diagonal blocks.

Binary to ary conversion. Let . Each block with binary entries in the two block sequences and is converted into a symbol in .

KT probability assignment. Apply KT sequential probability assignment for the two ary sequences and respectively. Given an ary sequence , KT sequential probability assignment defines conditional probability distributions over as follows. For , assign conditional probability
(9) where , and counts the number of symbol in .

Adaptive arithmetic coding. With the KT sequential probability assignments, compress the two sequences and separately using adaptive arithmetic coding [28].
Given the compressed graph sequence , the number of vertices and the block size , the graph decompressor is defined as follows.

Adaptive arithmetic decoding. With the KT sequential probability assignments defined in section 2.1, decompress the two code sequences for and separately using adaptive arithmetic decoding. The length of data sequence and are and respectively.

ary to binary conversion. Each ary symbol in the sequence is converted to a bit binary number and further converted into a block with binary entries.
Remark 1 ( compressor).
When , the diagonal sequence becomes an allzero sequence, since we assume the graph is simple. So we will only compress the offdiagonal sequence with the algorithm described above.
Remark 2 (Laplace probability assignment).
As an alternative to the KT sequential probability assignment, one can also use the Laplace sequential probability assignment. Given an ary sequence , Laplace sequential probability assignment defines conditional probability distributions over as follows. For , we assign conditional probability
(10) 
Both methods can be shown to be universal, while Laplace probability assignment has a much cleaner derivation. However, KT probability assignment produces a better empirical performance. For this reason, we keep both in the paper.
Remark 3 (Relation between probability assignment and compression length).
In Algorithm 1, the terms are added up, which lead to the marginal probability implied by the sequential probability assignment
(11) 
For KT probability assignment, it is known that the marginal probability is
(12) 
For Laplace probability assignment, it is known that the marginal probability is
(13) 
The compression output length of Algorithm 1 is . This relation will be the basis of our length analysis in Propositions 3 and 4.
Remark 4.
Remark 5.
Why is welldefined? The block decomposition and the binary to ary conversion are clearly onetoone. It is also known that for any valid probability assignment, arithmetic coding produces a prefix code, which as also onetoone.
Remark 6.
The orders in and do not matter in terms of establishing universality. The current orders in (7) and (8) together with arithmetic coding enable a horizon free implementation. That is, the encoder does not need to know the horizon to start processing the data and can output partial coded bits on the fly before receiving all the data. This leads to short encoding and decoding delay. For some realworld applications, for example, when the number of users increases in a large social network, this compressor has the advantage of not requiring to reprocess existing data and recompress the whole graph from scratch.
2.2 Experiments
We implement the proposed universal graph compressor (UGC) in four widely used benchmark graph datasets: proteintoprotein interaction network (PPI) [29], LiveJournal friendship network (Blogcatalog) [30], Flickr user network (Flickr) [30], and YouTube user network (YouTube) [31]. The block decomposition size is chosen to be and we present in Table 1 the best compression ratio (the ratio between output length and input length of the encoder) among all choices of . We compare UGC to four competing algorithms.

CSR: Compressed sparse row is a widely used sparse matrix representation format. In the experiment, we further optimize its default compressor exploiting the fact that the graph is simple and its adjacency matrix is symmetric with binary entries.

PNG: The adjacency matrix of the graph is treated as a grayscaled image and the PNG lossless image compressor is applied.
The compression ratios of the five algorithms implemented on four datasets are given as follows. The proposed UGC outperforms all competing algorithms in all datasets. The compression ratios from competing algorithms are 2.4 to 27 times that of the universal graph compressor.
UGC  CSR  LZ  PNG  Ligra+  
PPI  0.0226  0.166  0.06  0.089  0.0605 
Blogcatalog  0.0267  0.203  0.080  0.096  0.0682 
Flickr  0.00907  0.0584  0.0307  0.0262  0.0217 
YouTube 
3 Main Ideas in Establishing Universality
In this section, we establish the universality of the graph compressor in Section 2.1. We first calculate the entropy of the (random) graph , which, recall, is the fundamental lower bound on the expected codeword length for any compression scheme. Since to establishing optimality we need to show that , we will only be concerned with the first order term in .
Proposition 1 (Graph entropy).
Let with , and . For , let denote the binary entropy function. For a matrix with entries in , let be a matrix of the same dimension whose entry is . Then
(14)  
(15) 
In particular when and , expression (15) can be further simplified as
(16) 
Remark 7.
In the regime and , the above result has been established in [16]. We extend the analysis to the regime and .
Remark 8.
Proposition 1 can be used to calculate the entropy of the graph for certain important regimes of , in which the SBM displays characteristic behavior. For , we have ; for
(the regime where the phase transition for exact recovery of the community labels occurs
[34, 35]) we have ; when (the regime where the phase transition for detection between SBM and the Erdős–Rényi model occurs [36]), we have ; when (the regime where the phase transition for the existence of an edge occurs), we have .To compress the matrix , we wish to decompose it into a large number of components that have little correlation between them. This leads to the idea of block decomposition described previously. Since the sequence of blocks are used to compress we now prove that these blocks are identically distributed and asymptotically independent in a precise sense described as follows.
Proposition 2 (Block decomposition).
Let with for some , , and . Let be an integer that divides and . Consider the
block decomposition in (REF). We have all the offdiagonal blocks share the same joint distribution; all the diagonal blocks share the same joint distribution. In other words, for any
with and , we haveIn addition, if and , we have
(17) 
Because of this property of the block decomposition, we hope to compress these blocks as if they are independent using a Laplace probability assignment (which, recall, is universal for the class of all ary iid processes). However, since these blocks are still correlated (albeit weakly), we will need a result on the performance of Laplace probability assignment on correlated sequences with identical marginals, which we give next.
Proposition 3 (Laplace probability assignment for correlated sequence).
Consider , where each is identically distributed over an alphabet of size , but is not necessarily independent. Let where is the Laplace probability assignment in (13). We then have
(18) 
We provide a similar result for the KT probability assignment.
Proposition 4 (KT probability assignment for correlated sequence).
Consider , where each is identically distributed over an alphabet of size , but is not necessarily independent. Let where is the KT probability assignment in (12) We then have
(19) 
We are now ready to prove Theorem 1. The proof of Theorem 2 follows similar arguments as in Theorem 1 and is deferred to Section 4.5.
Proof of Theorem 1.
We will prove the universality of for both KT probability assignment and Laplace probability assignment. Note that the upper bound on the expected length of KT in (19) is upper bounded by the upper bound on the length of Laplace in (18). So it suffices to show Laplace probability assignment is universal.
We use the bound in Proposition 3 to establish the upper bound on the length of the code. Recall that here we compress the diagonal blocks (sized alphabet, blocks) and the offdiagonal blocks (sized alphabet, blocks) separately. We have,
where in (a) we bound and , and in (b) we note that since there are elements of the matrix (all apart from the diagonal elements) are distributed identically as . We will now analyze each of these three terms separately. Firstly, using Proposition 2 yields that . Next, since , we have and subsequently substituting , we have
since . Moreover, we have
where the penultimate equality used the fact that (since ). We have then established that
which finishes the proof. ∎
4 Proofs
4.1 Graph Entropy
4.2 Asymptotic i.i.d. via Block Decomposition
We first invoke a known property of stochastic block models (see, for example, [37]). We include the proof here for completeness.
Lemma 1 (Exchangeability of SBM).
Let . For a permutation , let be an matrix whose entry is given by . Then, for any permutation , the joint distribution of is the same as the joint distribution of , i.e.,
(24) 
Proof.
Let
be a realization of the random matrix
and be the permuted vector . For any symmetric binary matrix with zero diagonal entries, we have
Comments
There are no comments yet.