Distributed Compression of Graphical Data

02/21/2018 ∙ by Payam Delgosha, et al. ∙ 0

In contrast to time series, graphical data is data indexed by the nodes and edges of a graph. Modern applications such as the internet, social networks, genomics and proteomics generate graphical data, often at large scale. The large scale argues for the need to compress such data for storage and subsequent processing. Since this data might have several components available in different locations, it is also important to study distributed compression of graphical data. In this paper, we derive a rate region for this problem which is a counterpart of the Slepian-Wolf Theorem. We characterize the rate region when the statistical description of the distributed graphical data is one of two types - a marked sparse Erdos-Renyi ensemble or a marked configuration model. Our results are in terms of a generalization of the notion of entropy introduced by Bordenave and Caputo in the study of local weak limits of sparse graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, storing combinatorically structured data is of great importance in many applications such as the internet, social networks and biological data. For instance, a social network could be presented as a graph where each node models an individual and each edge stands for a friendship. Also, vertices and edges can carry marks, e.g. the mark of a vertex represents its type, and the mark of an edge represents its shared information. Due to the sheer amount of such data, compressing it has drawn attention, see e.g. [CS12], [Abb16], [DA17]. As the data is not always available in one location, it is important to also consider distributed compression of graphical data.

Traditionally, distributed lossless compression is modeled using two (or more) correlated stationary and ergodic processes representing the components of the data at the individual locations. In this case, the rate region is given by the Slepian–Wolf Theorem [CT12]

. We adopt an analogous framework, namely that two correlated marked random graphs on the same vertex set are presented to two encoders which then individually compress their data such that a third party can recover both realizations from the two compressed representations, with a vanishing probability of error in the asymptotic limit of data size.

We characterize the compression rate region for two scenarios, namely, a marked sparse Erdős–Rényi ensemble and a marked configuration model. We employ the framework of local weak convergence, also called the objective method, as a counterpart for marked graphs of the notion of stochastic processes [BS01, AS04, AL07]. Our characterization is best understood in terms of a generalization of a measure of entropy introduced by Bordenave and Caputo, which we call the BC entropy [BC14]. It turns out that the BC entropy captures the per–vertex growth rate of the Shannon entropy for the ensembles we study in this paper. This motivates it as a natural measure governing the asymptotic compression bounds.

The paper is organized as follows. In Section 2 we introduce the notation and formally state the problem. Sections 3 and 4 give a brief introduction to the objective method and the BC entropy, mostly specialized for the examples we study. Finally, in Section 5, we characterize the rate region for the scenarios we present in Section 2.

2 Notations and Problem Statement

The set of real numbers is denoted by . For an integer , denotes the set

. For a probability distribution

,

denotes its Shannon entropy. Also, for a random variable

, we denote by its Shannon entropy. For a positive integer and a sequence of positive integers such that , we define

For sequences of reals and we write if, for some constant , we have for large enough. Furthermore, we write if as . We denote by the indicator of the event . For a probability distribution , denotes that the random variable has law . Throughout the paper logarithms are to the natural base.

A marked graph with edge mark set and vertex mark set is a graph where each edge carries a mark in and each vertex carries a mark in . We assume that all graphs are simple unless otherwise stated. Also, we assume that all edge and vertex mark sets are finite. For two vertices and in a graph , denotes that and are adjacent in .

Let be a marked graph on a finite vertex set with edges and vertices carrying marks in the sets and

, respectively. We denote the edge mark count vector of

by , where is the number of edges in carrying mark . Furthermore, we denote the vertex mark count vector of by , where denotes the number of vertices in with mark . Additionally, for a graph on the vertex set , we denote the degree sequence of by where denotes the degree of vertex . For a degree sequence and an integer , we define

(1)

Also, for two degree sequences and , and two integers and , we define

(2)

Given a degree sequence , we let denote the set of simple unmarked graphs on the vertex set such that for .

Throughout this paper, we assume that and are two fixed and finite sets of edge marks. Moreover, and are two fixed and finite vertex mark sets. For and an integer , let be the set of marked graphs on the vertex set with edge and vertex mark sets and , respectively. For two graphs and , denotes the superposition of and which is a marked graph defined as follows: a vertex in carries the mark where is the mark of in . Furthermore, we place an edge in between vertices and if there is an edge between them in at least one of of , and mark this edge , where, for , is the mark of the edge in if it exists and otherwise. Here and are auxiliary marks not present in . Note that is a marked graph with edge and vertex mark sets and , respectively. We use the terminology jointly marked graph to refer to a marked graph with edge and vertex makr sets and respectively. With this, let denote the set of jointly marked graphs on the vertex set . Moreover, for , we say that a graph is in the –th domain if edge and vertex marks come from and , respectively. For a jointly marked graph and , the –th marginal of , denoted by , is the marked graph in the –th domain obtained by projecting all vertex and edge marks onto and , respectively, followed by removing edges with mark . Note that any jointly marked graph is uniquely determined by its marginals and , because . Given an edge mark count , for , with an abuse of notation we define

(3)

In a similar fashion, we define for . Likewise, given a vertex mark count vector , and for and , we define

(4)

Assume that we have a sequence of random graphs , drawn according to some ensemble distribution. Additionally, assume that there are two encoders who want to compress realizations of such jointly marked graphs in a distributed fashion. Namely, the –th encoder, , has only access to the –th marginal . We assume that the encoders know the distribution of .

Definition 1.

An code is a tuple of functions for each such that

and

The probability of error for this code corresponding to the ensemble of , which is denoted by , is defined as

Now we define our achievability criterion.

Definition 2.

A rate tuple is said to be achievable for the above scenario if there is a sequence of codes such that

(5)

and also . The rate region is defined as follows: for fixed and , if there are sequences and with limit points and in , respectively, such that for each , the rate tuple is achievable, then we include in the set .

In this paper, we characterize the above rate region for the following two sequences of ensembles:

The Erdős–Rényi ensemble: Assume that nonnegative real numbers together with a probability distribution are given such that for all and , we have

(6)

For an integer large enough, we define the probability distribution on as follows: for each pair of vertices , the edge is present in the graph and has mark with probability , and is not present with probability . Furthermore, each vertex in the graph is given a mark with probability . The choice of edge and vertex marks is done independently.

The configuration model ensemble: Assume that a fixed integer and a probability distribution supported on the set are given, such that . Moreover, assume that probability distributions and on the sets and , respectively, are given. We assume that for all and , we have

(7)

Furthermore, let be a sequence of degree sequences such that for all and , we have and also is even. Let . Additionally, if for , denotes the number of such that , we assume that for some constant ,

(8)

Now, we define the law on for large enough as follows. First, we pick an unmarked graph on the vertex set uniformly at random among the set of graphs with maximum degree such that for each , .111The fact that each degree is bounded to , and the sum of degrees is even implies that is a graphic sequence for large enough. This is, for instance, a consequence of Theorem 4.5 in [BC14]. Then, we assign i.i.d. marks with law on the edges and i.i.d. marks with law on the vertices.

As we will discuss in Section 3 below, the sequence of Erdős–Rényi ensembles defined above converges in the local weak sense to a marked Poisson Galton Watson tree. Moreover, the configuration model ensemble converges in the same sense to a marked Galton Watson process with degree distribution . We will characterize the achievability rate regions in Section 5 in terms of these limiting objects for the above two sequences of ensembles. This will turn out to be best understood in terms of a measure of entropy discussed in Section 4 below.

3 The framework of Local Weak Convergence

In this section, we discuss the framework of local weak convergence mainly in the context of the Erdős–Rényi and configuration model ensembles discussed in Section 2. For a general discussion, the reader is referred to [BS01, AS04, AL07].

Let and be finite mark sets. A marked graph with edge and vertex mark sets and respectively together with a distinguished vertex , is called a rooted marked graph and is denoted by . For a rooted marked graph and integer , denotes the neighborhood of , i.e. the subgraph consisting of vertices with distance no more than from . Note that is connected by definition. Two connected rooted marked graphs and are said to be isomorphic if there is a vertex bijection between the two graphs that maps to , preserves adjacencies and also preserves vertex and edge marks. With this, we denote the isomorphism class corresponding to a rooted marked graph by . We simply use as a shorthand for . Let denote the set of isomorphism classes of connected rooted marked graphs on a countable vertex set with edge and vertex marks coming from the sets and , respectively. It can be shown that can be turned into a separable and complete metric space [AL07]. For a probability distribution on , let denote the expected degree at the root in .

For a finite marked graph and a vertex in , let denote the connected component of . With this, if is a vertex chosen uniformly at random in , we define be the law of , which is a probability distribution on .

Let be a random jointly marked graph with law and let be a vertex chosen uniformly at random in the set . A simple Poisson approximation implies that , the number of edges adjacent to with mark , converges in distribution to a Poisson random variables with mean as goes to infinity. Moreover, are asymptotically mutually independent. A similar argument can be repeated for any other vertex in the neighborhood of . Also, it can be shown that the probability of having cycles converges to zero. In fact, the structure of converges in distribution to a rooted marked Poisson Galton Watson tree with depth .

More precisely, let be a rooted jointly marked tree defined as follows. First, the mark of the root is chosen from distribution . Then, for , we independently generate with law . We then add many edges with mark to the root

. For each offspring, we repeat the same procedure independently, i.e. choose its mark and edges with each mark from the corresponding Poisson distribution. Recursively repeating this, we get a connected jointly marked tree

rooted at , which has possibly countably infinitely many vertices. Let denote the law of the isomorphism class . Note that is a probability distribution on . The above discussion implies that converges in distribution to . In fact, even a stronger statement can be proved. More precisely, if we consider the sequence of random graphs independently on a joint probability space, converges weakly to with probability one. With this, we say that, almost surely, is the local weak limit of the sequence , where the term “local” stands for looking at a fixed depth neighborhood of a typical node.

ٌWith the above construction, for , let be the –th marginal of . Moreover, let be the law of . Therefore, is a probability distribution on . Similarly, one can see that, almost surely, is the local weak limit of the sequence .

A similar picture also holds for the configuration model. More precisely, let be a rooted jointly marked random tree constructed as follows. First, we generate the degree of the root with law . Then, for each offspring of , we independently generate the offspring count of with law defined as

where has law . We continue this process recursively, i.e. for each vertex other than the root, we independently generate its offspring count with law . The distribution is called the sized biased distribution, and takes into account the fact that each node other than the root has an extra edge on top of it, and hence its degree should be biased in order to get the correct degree distribution . Then, for each vertex and edge existing in the graph , we generate marks independently with laws and , respectively. Let be the law of . Moreover, for , let be the law of . It can be shown that if has law then, almost surely, is the local weak limit of , and is the local weak limit of , for .

Given and , not all probability distributions on can appear as the local weak limit of a sequence. In fact, the condition that all vertices have the same chance of being chosen as the root for a finite graph manifests itself as a certain stationarity condition at the limit called unimodularity [AL07].

4 The BC entropy

In this section, we discuss a notion of entropy for probability distributions on a space of rooted marked graphs. This is a marked version of the entropy defined by Bordenave and Caputo in [BC14], which was defined for probability distributions on the space of rooted (unmarked) graphs. To distinguish it from the Shannon entropy, we call this notion of entropy the BC entropy.

Let and be finite mark sets and let be a probability distribution on . Moreover, let and be sequences of edge and vertex mark counts, respectively, such that for all , converges to the expected number of edges with mark connected to the root in , and for all , converges to the probability of the mark of the root in being . Let be the set of graphs on the vertex set with edge and vertex marks in and , respectively, such that , , and is in the ball around with radius with respect to the Lévy–Prokhorov distance [Bil13].

Definition 3.

If , define

Note that both and decrease as decreases. Therefore, we may define the upper and lower BC entropies as and . If , we denote the common value by and call it the BC entropy of .

Using similar techniques as in the proof of Theorem 1.2 in [BC14], one can show that and do not depend on the specific choice of the sequences and , and for all with positive expected degree of the root.

Now, we connect the asymptotic behavior of the entropy of the ensembles defined in Section 2 to the BC entropy of their local weak limits. Assume that has law . Let . Moreover, we use the following notational conventions for and , .

(9)

For , let . If has law , it can be easily verified that with defined as for and zero for , we have

(10a)
(10b)
(10c)

Using a generalization of Theorem 1.3 in [BC14], it can be seen that the coefficient of in the above 3 equations are , and , respectively.

Similarly, for the configuration model, let be distributed according to and let be a random variable with law . Moreover, let , , be an i.i.d. sequence distributed according to . With this, let

(11)

Note that for , is basically the distribution of the degree of the root in . If and, for , , it can be seen that (see Appendix A for the details)

(12a)
(12b)
(12c)

Also, it can be seen that the coefficients of in the above equations are , and , respectively.

If is any of the two distributions or , and and are its marginals, we define the conditional BC entropies as and .

5 Main Results

Now, we are ready to state our main result, which is to characterize the rate region in Definition 2. In the following, for pairs of reals and , we write if either , or and . We also write if either or .

Theorem 1.

Assume is either of the two distributions or defined in Section 4. Then, if is the rate region for the sequence of ensembles corresponding to defined in Section 2, a rate tuple if and only if

(13a)
(13b)
(13c)

where , and .

We prove the achievability for the Erdős–Rényi case and the configuration model in Sections 5.1 and 5.2, respectively. Afterwards, we prove the converse for the two cases in Sections 5.3 and 5.4, respectively. Before this, we state the following general lemma used in the proofs, whose proof is straightforward using Stirling’s approximation.

Lemma 1.

Assume that a positive integer and sequences of integers and are given.

  1. If and for each , where , we have

  2. If and , we have

    where is defined to be for and if .

5.1 Proof of Achievability for the Erdős–Rényi case

Here we show that a rate tuple is achievable for the Erdős–Rényi ensemble if it satisfies the following

(14a)
(14b)
(14c)

Note that if a rate tuple satisfies the weak inequalities (13a)–(13c) then, for any , satisfies the above strict inequalities. As we show below, this implies that is achievable. Hence, after sending , we get .

We show that any satisfying (14a)–(14c) is achievable by employing a random binning method. More precisely, for , we set and for each , we assign uniformly at random in the set and independent of everything else.

To describe our decoding scheme, we first need to define some notation. Let denote the set of edge count vectors such that

Moreover, let denote the set of vertex mark count vectors such that