Universal Graph Compression: Stochastic Block Models

06/04/2020 ∙ by Alankrita Bhatt, et al. ∙ Microsoft The University of British Columbia University of California, San Diego 0

Motivated by the prevalent data science applications of processing and mining large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this paper investigates lossless compression of data appearing in the form of a labeled graph. A universal graph compression scheme is proposed, which does not depend on the underlying statistics/distribution of the graph model. For graphs generated by a stochastic block model, which is a widely used random graph model capturing the clustering effects in social networks, the proposed scheme achieves the optimal theoretical limit of lossless compression without the need to know edge probabilities, community labels, or the number of communities. The key ideas in establishing universality for stochastic block models include: 1) block decomposition of the adjacency matrix of the graph; 2) generalization of the Krichevsky-Trofimov probability assignment, which was initially designed for i.i.d. random processes. In four benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr, and YouTube), the compressed files from competing algorithms (including CSR, Ligra+, PNG image compressor, and Lempel-Ziv compressor for two-dimensional data) take 2.4 to 27 times the space needed by the proposed scheme.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many data science applications, data appears in the form of large-scale graphs. For example, in social networks, vertices represent users and an edge between vertices represents friendship; in World Wide Web, vertices are websites and edges indicate the hyperlinks from one site to the other; in biological systems, vertices can be proteins and edges illustrate protein-to-protein interaction. Such graphs may contain billions of vertices. In addition, edges tend to be correlated with each other since, for example, two people sharing many common friends are likely to be friends as well. How to efficiently compress such large-scale structural information to reduce the I/O and communication costs in storing and transmitting such data is an emerging challenge in the era of big data.

The literature on graph compression is vast. Existing compression schemes follow various different methodologies. Several methods exploited combinatorial properties such as cliques and cuts in the graph [1, 2]. Many works targeted at domain-specific graphs such as web graphs [3], biology networks [4, 5], and social network graphs [6]. Various representations of graphs were proposed, such as the text-based method, where the neighbor list of each vertex is treated as a “word” [7, 8], and the -tree method, where the adjacency matrix is recursively partitioned into equal-size submatrices [9]. Succinct graph representations that enable certain types of fast computation, such as adjacency query or vertex degree query, were also widely studied [10]. While most compression schemes are for labeled graphs, there are also works considering lossless compression of unlabeled graphs [11, 12, 13], an evolving sequence of graphs [14, 15], or (correlated) data on the graph [16, 17]. We refer the readers to [18] for an exhaustive survey on lossless graph compression and space-efficient graph representations.

In this paper, we take an information theoretic approach to study lossless compression of a graph. We assume the graph is generated by some random graph model and investigate lossless compression schemes that achieve the theoretical limit, i.e., the entropy of the graph, asymptotically as the number of vertices goes to infinity. When the underlying distribution/statistics of the random graph model is known, optimal lossless compression can be achieved by methods like Huffman coding. However, in most real-world applications, the exact distribution is usually hard to obtain and the data we are given is a single realization of this distribution. This motivates us to consider the framework of universal compression, in which we assume the underlying distribution belongs to a known family of distributions and require that the encoder and the decoder should not be a function of the underlying distribution. For this paper, we focus on the family of stochastic block models, which are widely used random graph models that capture the clustering effect in social networks. Our goal is to develop a universal graph compressor for a family of stochastic block models with as wide range of parameters as possible.

Universal compression for one-dimensional sequences is a well-studied area, especially for the family of independent and identically distributed (i.i.d.) processes and the family of stationary ergodic processes. A large number of universal compressors have been proposed for sequences, such as the Laplace and Krichevsky–Trofimov (KT) compressor for i.i.d. processes [19, 20], Lempel–Ziv [21, 22] and Burrows–Wheeler transform[23] for stationary ergodic processes, and context tree weighting [24] for finite memory processes. Many of these have been adopted in standard data compression applications such as compress, gzip, GIF, TIFF, and bzip2. A natural question arising here is: Can we convert the two-dimensional adjacency matrix of the graph into a one-dimensional sequence in some order and apply a universal compressor for the sequence? For some simple graph model such as Erdős–Rényi graph, where each edge is generated i.i.d. with probability , this would indeed work. For more complex graph models including stochastic block models, where edges are correlated to each other, it is unclear whether there is an ordering of the entries that results in a stationary process. We will show in Section 5 several orders including row-by-row, column-by-column, and diagonal-by-diagonal fail to produce a stationary process. On the other hand, in certain regime of stochastic block models (as shown in Theorem 2), we manage to establish the universality of Laplace and KT compressors using this approach through a generalization of the analysis from i.i.d. processes to identical but not necessarily independent processes. For experiments, we implement the ordering of Peano–Hilbert space filling curve and apply Lempel–Ziv compressor [25] in four benchmark graph datasets. Its compression length turns out to be 2.5 to 4 times the length of our proposed algorithm.

Lossless compression for stochastic block models was first studied by Abbe [16]. The focus there is two-fold: 1) compute the entropy of the stochastic block model; 2) explore the relation between community detection and compression. Several interesting questions were presented: Knowing the community labels will help compression, since edges can be grouped into i.i.d. subsets. But is community detection necessary for compression? In the regime when community detection is not possible, how do we compress the graph? We answer these questions in this paper by presenting a universal compressor that does not require knowledge of the edge probabilities, the community labels, or the number of communities. Our compressor remains universal even in the regime when community detection is not possible. As a consequence, universal compression can be an easier task than community detection for stochastic block models.

The rest of the paper is organized as follows. In Section 1.1, we define universality over a family of graph distributions and the stochastic block models. We present our main result in Section 1.2, which is a graph compressor that is universal for a family containing most non-trivial stochastic block models. We describe the encoding procedure of the proposed graph compressor in Section 2.1. We implement our compressor in four benchmark graph datasets and compare its empirical performance to four competing algorithms in Section 2.2. We illustrate key steps in establishing universality in Section 3 and elaborate the proof of each step in Section 4. Section 5 explains why existing universal compressors developed for stationary processes may not be immediately applicable for some one-dimensional ordering of entries in the adjacency matrix.

Notation. For an integer , let . Let . We follow the standard order notation: if ; if ; if and ; if ; if ; and if .

1.1 Problem Setup

For simplicity, we focus on simple (undirected, unweighted, no self-loop) graphs with labeled vertices in this paper. But our compression scheme and the corresponding analysis can be extended to more general graphs. Let be the set of all labeled simple graphs on vertices. Let be the set of binary sequences of length , and set . A lossless graph compressor is a one-to-one function that maps a graph to a binary sequence. Let denote the length of the output sequence. When is generated from a distribution, it is known that the entropy is a fundamental lower bound on the expected length of any lossless compressor [26, Theorem 8.3]


and therefore

Thus, a graph compressor is said to be universal for the family of distributions if for all distribution and , we have


A stochastic block model

defines a probability distribution over

. Here is the number of vertices, is the number of communities. Each vertex is associated with a community label . The length-

column vector

is a probability distribution over , where indicates the probability that any vertex is assigned community . is an symmetric matrix, where represents the probability of having an edge between a vertex with community label and a vertex with community label . We say if the community labels are generated i.i.d. according to and for every pair , an edge is generated between vertex and vertex with probability . In other words, in the adjacency matrix of the graph, for ; the diagonal entries for all ; and for . We assume all the entries in are in the same regime and write , where is an symmetric matrix with constant entries for all . We assume all entries in are . We will consider two families of stochastic block models: For ,


Note that the edge probability is the threshold for a random graph to contain an edge with high probability [27]. Thus, the family covers most non-trivial SBM graphs. Clearly, is a strict subset of , as it does not contain the constant regime .

1.2 Main Results

Theorem 1 (Universality over ).

For every , the graph compressor defined in Section 2.1 is universal over the family provided that

Theorem 2 (Universality over ).

For every , the graph compressor defined in Section 2.1 is universal over the family .

2 Algorithm

2.1 Universal Graph Compressor

For each that divides , the graph compressor is defined as follows.

  • Block decomposition. Let . For , let be the submatrix of formed by the rows and the columns . For example, we have


    We then write in the block-matrix form as




    as the sequence of off-diagonal blocks in the upper triangle and


    as the sequence of diagonal blocks.

  • Binary to -ary conversion. Let . Each block with binary entries in the two block sequences and is converted into a symbol in .

  • KT probability assignment. Apply KT sequential probability assignment for the two -ary sequences and respectively. Given an -ary sequence , KT sequential probability assignment defines conditional probability distributions over as follows. For , assign conditional probability


    where , and counts the number of symbol in .

  • Adaptive arithmetic coding. With the KT sequential probability assignments, compress the two sequences and separately using adaptive arithmetic coding [28].

    Input : Data sequence , alphabet size
    Initialize ;
    for  do
           for  do
                 Compute ;
    Output : the binary representation of with bits
    Algorithm 1 -ary adaptive arithmetic encoding with KT probability assignment

Given the compressed graph sequence , the number of vertices and the block size , the graph decompressor is defined as follows.

  • Adaptive arithmetic decoding. With the KT sequential probability assignments defined in section 2.1, decompress the two code sequences for and separately using adaptive arithmetic decoding. The length of data sequence and are and respectively.

    Input : Binary sequence , alphabet size , length of data sequence
    Add ‘’ before sequence and convert it into a decimal real number . Initialize ;
    for  do
           for  do
                 Compute ;
          Find minimum such that ;
    Output : the -ary data sequence
    Algorithm 2 -ary adaptive arithmetic decoding with KT probability assignment
  • -ary to binary conversion. Each -ary symbol in the sequence is converted to a -bit binary number and further converted into a block with binary entries.

  • Adjacency matrix recovery. With the blocks in and , recover the adjacency matrix of in the order described in (6), (7), and (8).

Remark 1 ( compressor).

When , the diagonal sequence becomes an all-zero sequence, since we assume the graph is simple. So we will only compress the off-diagonal sequence with the algorithm described above.

Remark 2 (Laplace probability assignment).

As an alternative to the KT sequential probability assignment, one can also use the Laplace sequential probability assignment. Given an -ary sequence , Laplace sequential probability assignment defines conditional probability distributions over as follows. For , we assign conditional probability


Both methods can be shown to be universal, while Laplace probability assignment has a much cleaner derivation. However, KT probability assignment produces a better empirical performance. For this reason, we keep both in the paper.

Remark 3 (Relation between probability assignment and compression length).

In Algorithm 1, the terms are added up, which lead to the marginal probability implied by the sequential probability assignment


For KT probability assignment, it is known that the marginal probability is


For Laplace probability assignment, it is known that the marginal probability is


The compression output length of Algorithm 1 is . This relation will be the basis of our length analysis in Propositions 3 and 4.

Remark 4.

The computational complexity of the proposed algorithm is . For the choice of that achieves universality over family in Theorem 1, for . For the choice of that achieves universality over family in Theorem 2, .

Remark 5.

Why is well-defined? The block decomposition and the binary to -ary conversion are clearly one-to-one. It is also known that for any valid probability assignment, arithmetic coding produces a prefix code, which as also one-to-one.

Remark 6.

The orders in and do not matter in terms of establishing universality. The current orders in (7) and (8) together with arithmetic coding enable a horizon free implementation. That is, the encoder does not need to know the horizon to start processing the data and can output partial coded bits on the fly before receiving all the data. This leads to short encoding and decoding delay. For some real-world applications, for example, when the number of users increases in a large social network, this compressor has the advantage of not requiring to re-process existing data and re-compress the whole graph from scratch.

2.2 Experiments

We implement the proposed universal graph compressor (UGC) in four widely used benchmark graph datasets: protein-to-protein interaction network (PPI) [29], LiveJournal friendship network (Blogcatalog) [30], Flickr user network (Flickr) [30], and YouTube user network (YouTube) [31]. The block decomposition size is chosen to be and we present in Table 1 the best compression ratio (the ratio between output length and input length of the encoder) among all choices of . We compare UGC to four competing algorithms.

  • CSR: Compressed sparse row is a widely used sparse matrix representation format. In the experiment, we further optimize its default compressor exploiting the fact that the graph is simple and its adjacency matrix is symmetric with binary entries.

  • LZ: This is an implementation of the algorithm proposed in [25], which first transforms the two-dimensional adjacency matrix into a one-dimensional sequence using the Peano–Hilbert space filling curve and then compresses the sequence using Lempel–Ziv 78 algorithm [22].

  • PNG: The adjacency matrix of the graph is treated as a gray-scaled image and the PNG lossless image compressor is applied.

  • Ligra+: This is another powerful sparse matrix representation format [32, 33], which improves upon CSR using byte codes with run-length coding.

The compression ratios of the five algorithms implemented on four datasets are given as follows. The proposed UGC outperforms all competing algorithms in all datasets. The compression ratios from competing algorithms are 2.4 to 27 times that of the universal graph compressor.

PPI 0.0226 0.166 0.06 0.089 0.0605
Blogcatalog 0.0267 0.203 0.080 0.096 0.0682
Flickr 0.00907 0.0584 0.0307 0.0262 0.0217
Table 1: Comparison of the compression ratios.

3 Main Ideas in Establishing Universality

In this section, we establish the universality of the graph compressor in Section 2.1. We first calculate the entropy of the (random) graph , which, recall, is the fundamental lower bound on the expected codeword length for any compression scheme. Since to establishing optimality we need to show that , we will only be concerned with the first order term in .

Proposition 1 (Graph entropy).

Let with , and . For , let denote the binary entropy function. For a matrix with entries in , let be a matrix of the same dimension whose entry is . Then


In particular when and , expression (15) can be further simplified as

Remark 7.

In the regime and , the above result has been established in [16]. We extend the analysis to the regime and .

Remark 8.

Proposition 1 can be used to calculate the entropy of the graph for certain important regimes of , in which the SBM displays characteristic behavior. For , we have ; for

(the regime where the phase transition for exact recovery of the community labels occurs 

[34, 35]) we have ; when (the regime where the phase transition for detection between SBM and the Erdős–Rényi model occurs [36]), we have ; when (the regime where the phase transition for the existence of an edge occurs), we have .

To compress the matrix , we wish to decompose it into a large number of components that have little correlation between them. This leads to the idea of block decomposition described previously. Since the sequence of blocks are used to compress we now prove that these blocks are identically distributed and asymptotically independent in a precise sense described as follows.

Proposition 2 (Block decomposition).

Let with for some , , and . Let be an integer that divides and . Consider the

block decomposition in (REF). We have all the off-diagonal blocks share the same joint distribution; all the diagonal blocks share the same joint distribution. In other words, for any

with and , we have

In addition, if and , we have


Because of this property of the block decomposition, we hope to compress these blocks as if they are independent using a Laplace probability assignment (which, recall, is universal for the class of all -ary iid processes). However, since these blocks are still correlated (albeit weakly), we will need a result on the performance of Laplace probability assignment on correlated sequences with identical marginals, which we give next.

Proposition 3 (Laplace probability assignment for correlated sequence).

Consider , where each is identically distributed over an alphabet of size , but is not necessarily independent. Let where is the Laplace probability assignment in (13). We then have


We provide a similar result for the KT probability assignment.

Proposition 4 (KT probability assignment for correlated sequence).

Consider , where each is identically distributed over an alphabet of size , but is not necessarily independent. Let where is the KT probability assignment in (12) We then have


We are now ready to prove Theorem 1. The proof of Theorem 2 follows similar arguments as in Theorem 1 and is deferred to Section 4.5.

Proof of Theorem 1.

We will prove the universality of for both KT probability assignment and Laplace probability assignment. Note that the upper bound on the expected length of KT in (19) is upper bounded by the upper bound on the length of Laplace in (18). So it suffices to show Laplace probability assignment is universal.

We use the bound in Proposition 3 to establish the upper bound on the length of the code. Recall that here we compress the diagonal blocks (sized alphabet, blocks) and the off-diagonal blocks (sized alphabet, blocks) separately. We have,

where in (a) we bound and , and in (b) we note that since there are elements of the matrix (all apart from the diagonal elements) are distributed identically as . We will now analyze each of these three terms separately. Firstly, using Proposition 2 yields that . Next, since , we have and subsequently substituting , we have

since . Moreover, we have

where the penultimate equality used the fact that (since ). We have then established that

which finishes the proof. ∎

4 Proofs

4.1 Graph Entropy

Proof of Proposition 1.

Note that


where (21) follows since all the edges are identically distributed and also independent given and consequently

When , we see that since

we have that .

Next, consider the case when and . By properties of the entropy, we have


Note that

which yields that . Substituting this in (22) gives


Note now for any , we have

By noting that and as we see that

Using this, we note that and . Finally, substituting this into (23) yields

as required. ∎

4.2 Asymptotic i.i.d. via Block Decomposition

We first invoke a known property of stochastic block models (see, for example, [37]). We include the proof here for completeness.

Lemma 1 (Exchangeability of SBM).

Let . For a permutation , let be an matrix whose entry is given by . Then, for any permutation , the joint distribution of is the same as the joint distribution of , i.e.,



be a realization of the random matrix

and be the permuted vector . For any symmetric binary matrix with zero diagonal entries, we have