Community Detection with Side Information: Exact Recovery under the Stochastic Block Model

05/22/2018 ∙ by Hussein Saad, et al. ∙ The University of Texas at Dallas 0

The community detection problem involves making inferences about node labels in a graph, based on observing the graph edges. This paper studies the effect of additional, non-graphical side information on the phase transition of exact recovery in the binary stochastic block model (SBM) with n nodes. When side information consists of noisy labels with error probability α, it is shown that phase transition is improved if and only if (1-α/α)=Ω((n)). When side information consists of revealing a fraction 1-ϵ of the labels, it is shown that phase transition is improved if and only if (1/ϵ)=Ω((n)). For a more general side information consisting of K features, two scenarios are studied: (1) K is fixed while the likelihood of each feature with respect to corresponding node label evolves with n, and (2) The number of features K varies with n but the likelihood of each feature is fixed. In each case, we find when side information improves the exact recovery phase transition and by how much. The calculated necessary and sufficient conditions for exact recovery are tight except for one special case. In the process of deriving inner bounds, a variation of an efficient algorithm is proposed for community detection with side information that uses a partial recovery algorithm combined with a local improvement procedure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The problem of learning or detecting community structures in random graphs has been studied in statistics [1, 2, 3, 4, 5], computer science [6, 7, 8, 9, 10] and theoretical statistical physics [11, 12]. Detection of communities on graphs is motivated by applications including finding like-minded people in social networks [13], improving recommendation systems [14], and detecting protein complexes [15]. Among the different random graph models [16, 17], the stochastic block model (SBM) is widely used in the context of community detection[18]. This extension of the Erdös-Renyi model consists of nodes that belong to two communities, each pair of nodes connected with probability if the pair belongs to the same community, and with probability otherwise. The prior distribution of the node labels is identical and independent, and often uniform (labels are equi-probable). The goal of community detection is to recover/detect the labels upon observing the graph edges.

Random graphs experience measure concentration in the recovery of labels [18], i.e., for some underlying graph distributions, recovered labels will become reliable as the size of data set increases, and for others they do not. The boundary of this phenomenon is often described as a phase transition [18]. The location of this phase transition and the set of graphs that fall inside the region described by it, is an important indicator of the broad class of graph-based problems that are reliably solvable in the context of community detection. Much of the theoretical work on community detection [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 18] concentrates on characterizing this phase transition and understanding its properties.

The literature on community detection has, for the most part, concentrated on purely graphical observations. However, in many practical applications, non-graphical relevant information is available that can aid the inference. For example, social networks such as Facebook and Twitter have access to much information other than the graph edges. A citation network has the authors’ names, keywords, and abstracts of papers, and therefore may provide significant additional information beyond the co-authoring relationships. Figure 1 illustrates standard community detection as well as community detection with side information. This paper presents new results on the utility of side information in community detection, in particular shedding light on the conditions under which side information can improve the phase transition of community detection, and the magnitude of the improvement.

Observed Graph

Detected communities

Graph + side information

Enhanced detection

Fig. 1: (top) standard community detection (bottom) Community detection with side information

Community detection outcomes fall into several broad categories in terms of residual error as the size of the graph grows, enumerated here in increasing order of strength: Correlated recovery refers to community detection that performs better than random guessing [19, 20, 21, 22, 23]. Weak recovery means the fraction of misclassified labels in the graph vanishes with probability converging to one [24, 25, 26]. Exact recovery means correct recovery of all nodes with probability converging to one [27, 28, 18]. This paper concentrates on the exact recovery metric.111Formally, let denote the number of misclassified nodes. Then, correlated recovery means . Weak recovery means for any positive . Exact recovery means .

A few results have recently appeared in the literature on the broader community detection problem in the presence of additional (non-graphical) information. Mossel and Xu [29] studied the behavior of belief propagation detector in the presence of noisy label information. Cai et al. [30] studied the effect of knowing a growing fraction of labels on correlated and weak recovery. Neither of [29, 30] includes a converse, so they do not establish phase transition. Kadavankandy et al. [31] studied the single-community problem with noisy label observations, showing weak recovery in the sparse regime. Kanade et al. [32] showed that partial observation of labels is unhelpful to the correlated recovery phase transition if a vanishing portion of labels are available. The exact recovery metric is not addressed in these works, and they do not establish a phase transition under side information.222Arguably the closest result in the literature to our work can be found in [33, Theorem 4], which is discussed in Section V-B.

In the interest of completeness, we also mention the following works even though they have a very different perspective. In statistics, several works have appeared on model-matching to real data consisting of both graphical and non-graphical observations, where additional information such as “annotation” [34], “attributes” [35], or “features” [36]

has been considered. These works aim at model matching to real (finite) data sets, and propose a parametric model that expresses the joint probability distribution of the graphical and non-graphical (attribute/feature) observations. Although the focus of these papers is very different from the present paper, they nevertheless show the interest of the broader community in modeling side-information for graph-based inference.

The following observations further motivate this work. For the exact recovery metric, the effect of side information has not been comprehensively studied. Even for correlated recovery and weak recovery, the effect of side information has only been studied for belief propagation, which is not enough to establish phase transition. In the context of binary labels, only binary side information (possibly with erasures) has been studied. Practical scenarios motivate the study of more general side information whose alphabet does not match the number/identity of communities. Also of interest is side information consisting of several (potentially non-binary) features, which has not been thoroughly investigated either in the context of belief propagation or maximum likelihood, although [33, Theorem 4] opened the subject in a special setting.

Ii System Model and Contributions

We consider the binary symmetric stochastic block model, with community labels denoted and . The number of nodes in the graph is denoted with . The node labels are independent and identically distributed across , with and labels having equal probability. If two nodes belong to the same community, there is an edge between them with probability , and if they are from different communities, there is an edge between them with probability

. Finally, for each node one or more scalar random variables are observed containing side information. Conditioned on node labels, the side information of different nodes are assumed to be independent of each other and of the graph edges. Three models for this side information are considered.

In the first model, for each node, a scalar side information is observed which is the true label with probability and its complement (false) with probability , where . In the second model, for each node, a scalar side information is observed which is the true label with probability or (erased) with probability , where . In the third model, we consider side information consisting of random variables (features) with finite cardinalities , .

The observed graph is denoted by

, the vector of nodes’ true assignment by

, and the nodes’ side information by vector when each node has a scalar side information, or with collection of length- vectors when side information for each node consists of features. The goal is to recover the node assignment from the observation of the graph and side information.

In this paper, exact recovery is considered in the dense regime, i.e., when and with constants . In this regime the exact recovery phase transition without side information is  [27]. We investigate the question: when and by how much can side information affect the phase transition threshold of exact recovery? The contributions of this paper are as follows:

  • When side information consists of observing node labels with erasure probability , we show that if , the phase transition is not improved by side information. On the other hand, if for some , i.e., , a necessary and sufficient condition for exact recovery is .

  • When side information consists of observing node labels with error probability , if is , then the phase transition is not improved by side information. On the other hand, if , i.e., , necessary and sufficient conditions for exact recovery are derived as follows:

    with the following parameters defined for convenience:

    (1)
    (2)

    An early version of this result appeared in [37].

  • When side information consists of features each with finite and fixed cardinality, two scenarios are considered: (1) is fixed while the conditional distribution of each feature varies with . In this scenario, we study how the quality of each feature must evolve as the size of the graph grows, so that phase transition can be improved. (2) varies with while the conditional distribution of features is fixed. In this scenario, the quality of the features is independent of , and we study how many features are needed in addition to the graphical information, so that the phase transition can be improved.

  • Sufficient conditions are provided via an efficient algorithm employing partial recovery and a local improvement using both the graph and the side information. The two-step recovery algorithm without side information appeared in [27, 18, 38]. In this paper, it is refined and generalized in the presence of side information.

Remark 1

In earlier community detection problems [27, 18], LLRs do not depend on even though individual likelihoods (obviously) do. This was very fortunate for calculating asymptotics. In the presence of side information, this convenience disappears and LLRs will now depend on , creating complications in bounding error event probabilities en route to finding the threshold in the asymptote of large . Overcoming this technical difficulty is part of the contributions of this paper.

To illustrate the results of this paper, Figures 23 show the error exponent for the side information consisting of partially revealed labels or noisy label observation, as a function of . It is observed that the value of needed for recovery depends on . For the partially revealed labels, when , the critical is . For noisy label observations, when , the value of critical can be determined as follows: if , then the critical is the solution to . On the other hand, if , then the critical is one.

Fig. 2: Error exponent for noisy label observations as a function of .
Fig. 3: Error exponent of partial label observation as a function of .

Iii Noisy Label Side Information

In this section, side information consists of a noisy version of the label that with probability fails to match the true label.

We begin by calculating the maximum likelihood rule for detecting the communities under side information. The maximum likelihood detector without side information [27] is the minimizer of the number of edges between two detected communities, subject to both detected communities having size . The set of nodes belonging to the two communities are denoted with and , i.e., and . denotes the number of edges whose two vertices belong to community , and the number of edges whose two vertices belong to community . The total number of edges in the graph is denoted . Also, define:

Then, the log-likelihood function can be written as:

(3)

where holds because are independent given . In , all terms that are independent of have been collected into a constant , and has been approximated by , which is made possible because both approach as . The difference between Eq. (III) and the likelihood function without side information is the term and a constant that is hidden inside .

The following lemma characterizes a lower bound on the probability of failure of the maximum likelihood detector. Let denote the number of edges between two sets of nodes.333For economy of notation, in the arguments of we represent singleton sets by their single member.

Lemma 1

Let and denote the true communities. Define the following events:

(4)

Then, .

Define two new communities and . If it means maximum likelihood chooses incorrectly and therefore fails. We show that this happens under .

Let be a random variable representing the existence of the edge between nodes and . Then, using (III):

(5)

where holds by the assumption that happened and holds because and . The inequality implies the failure of maximum likelihood.

Iii-a Necessary Conditions

Theorem 1

Define . The maximum likelihood failure probability is bounded away from zero if:

Since is generated uniformly, the ML detector is optimal in error probability. Hence, if ML fails with nonzero probability, every other detector must fail with nonzero probability. So it suffices to establish the error probability of ML. The main difficulty in bounding the error probability of ML is the dependency between the graph edges. To overcome this dependency, we follow steps that are broadly similar to [27], but our bounding techniques involve Chernoff type arguments and Cramer and Sanov large deviation principles that are more compact than combinatorial techniques of [27].

Definition 1

Let be a subset of with and define the following events for each node :

and the following events defined on :

Lemma 2

If and for , then there exists a positive so that .

Clearly . Hence,

By the symmetry of the graph and the side information, as well. Also, by Lemma 1 . Then:

For , is bounded away from zero.

Lemma 3

Let . Then:

via a multiplicative form of Chernoff bound, stating that a sequence of i.i.d random variables , , where . Thus, by union bound:

Lemma 4

For any and for sufficiently large , if , then .

Because are i.i.d.:

(6)

where the last inequality holds by the statement of the Lemma. If is , then the quantity inside the bracket tends to and the result follows. If is not , then from Eq. (6) it follows that and again the result of the Lemma holds.

The following lemma completes the proof of Theorem 1.

Lemma 5

For sufficiently large , for , if one of the following is satisfied:

See Appendix A. Combining Lemmas 2345 concludes the proof of the theorem.

Iii-B Sufficient Conditions

Sufficient conditions are derived via a two-step algorithm whose first step uses a component from [21], a method based on spectral properties of the graph that achieves weak recovery.

We start with an independently generated random graph built on the same nodes where each candidate edge has probability . The complement of is denoted . Then is partitioned as follows: and . will be used for the weak recovery step, for local modification. The partitioning of allows the two steps to remain independent.

We perform a weak recovery algorithm [21] on . Since is a graph with connectivity parameters , the weak recovery algorithm is guaranteed to return two communities , that agree with the true communities , on at least nodes so that (i.e., weak recovery). A sufficient condition for that to happen [21], e.g., is .

The community assignments are locally modified as follows: for a node , flip its membership if the number of edges between and is greater than or equal the number of edges between and plus . For node , flip its membership if the number of edges between and is greater than or equal the number of edges between and minus . If the number of flips in the two clusters are not the same, keep the clusters unchanged. The detailed algorithm is shown in Table I.

Algorithm 1
1: Start with graph and side information
2: Generate an Erdös-Renyi graph with edge probability . Use it to partition into and .
3: Apply weak recovery algorithm [21] on , calling the resulting communities .
4: Initialize and .
5: For each node modify and as follows:
 Flip membership if and
 Flip membership if and
6: Check size of communities. If or equivalently , discard changes via and .
TABLE I: Algorithm for exact recovery.
Theorem 2

With probability approaching one as grows, the algorithm above successfully recovers the communities if:

We first upper bound the misclassification probability of a node assuming is a complete graph, then adjust the bound to account for the departure of from a complete graph.

Fig. 4:

Two types of error events for the two-stage algorithm. The node in the top half of the figure is misclassified in weak recovery, and remains uncorrected via local modification. The node at the bottom half is correctly classified in weak recovery, but is mistakenly flipped by local modification.

Fig. 4 shows the mis-classification conditions: an error happens either when the weak recovery was correct and is overturned by the local modification, or when the weak recovery is incorrect and is not corrected by local modification. Let and represent edges inside a community and across communities, respectively. Let with probabilities , respectively. For simplicity, we will write instead of . Then, the mis-classification probability is:

(7)

To adjust for the fact that is not complete, the following Lemma is used, noting that .

Lemma 6

With high probability, the degree of any node in is at most .

Let be a sequence of i.i.d. Bernoulli random variables with parameter . Define . Then, and hence, by Chernoff bound:

(8)

Thus, by using a union bound:

Having bounded from below the degree of , the correct error probability (for the incomplete ) can be arrived at by removing no more than terms from the summations on the right hand side of (7). If we remove exactly terms, the following upper bound on error probability is obtained:

(9)

The following lemma shows an upper bound on .

Lemma 7

See Appendix B.

A simple union bound yields:

(10)

For the last case, remains sufficient because of the following lemma.

Lemma 8

.

Let . Then, from the definition of :

(11)

Since is convex in , it can be shown that at the optimal , . Using this fact and substituting in (11):

(12)

By the definition of : . Using the fact that leads to , which implies that . Hence, by substituting in (12):

(13)

Also, it can be shown that at , . This implies that . Substituting in (13) leads to: , which implies that when .

Combining the last lemma with (10) concludes the proof.

Iv Partially Revealed Labels

In this section, we consider side information consisting of partially revealed labels, where is the proportion of labels that remains unknown despite the side information. Tight necessary and sufficient conditions are presented for exact recovery under this type of side information. Similar to the noisy label side information, we begin by expressing the log-likelihood function. For a given side information vector , if a label contradicts the side information.444We say a label contradicts the side information if the side information is not an erasure and it disagrees with the label. All label vectors that do not contradict side information and satisfy the balanced prior, have the same conditional probability. Thus, for all that have non-zero conditional probability, the log-likelihood function can be written as:

(14)

where holds because are independent given . In , all terms that are independent of have been collected into a constant , and has been approximated by , which is made possible because both approach as .

The following lemma shows that if the graph includes at least one pair of nodes that have more connections to the opposite-labels than similar-labels and if their side information is an erasure, the maximum likelihood detector will fail.

Lemma 9

Define the following events:

Then, .

From the sets , we swap the nodes , producing and . We intend to show that subject to observing the graph and the side information , the likelihood of is larger than the likelihood of , therefore under the condition , maximum likelihood will fail.

Let be a random variable representing the existence of the edge between nodes and . Then, from (IV):

(15)

where holds by the assumption that happened and holds because and . The inequality implies the failure of maximum likelihood.

Iv-a Necessary Conditions

Theorem 3

The maximum likelihood failure probability is bounded away from zero if:

  • and

  • , , and

Let be a subset of with . Consider the following modification to Definition 1:

It is not difficult to show that Lemmas 234 remain valid under this modification. To complete the proof, it is sufficient to find conditions under which asymptotically (in ) for all .

Lemma 10

For sufficiently large , for , if one of the following is satisfied:

See Appendix C. Combining Lemma 10 with the modified form of Lemmas 23, and 4, concludes the proof of the theorem.

Iv-B Sufficient Conditions

This section shows sufficient conditions for exact recovery by introducing an algorithm whose exact recovery conditions are identical to Section IV-A. The first stage of the algorithm is the same as Section III-B. The second stage involving local modification is new and is described below.

The community assignments are locally modified for each node as follows: (a) if membership contradicts side information , flip node membership or (b) if , re-assign membership of to the community to which it is connected with more edges. After going through all nodes, if the the number of flips in two communities are not the same, void all local modifications.

Theorem 4

The algorithm described above successfully recovers the communities with high probability if:

Let . Following the same analysis as in the proof of Theorem 2:

(16)

Using Lemma 7 and strengthening to , equation (16) can be upper bounded as follows:

(17)

Thus, according to asymptotic behavior of :

A simple union bound yields:

V More General Side Information

We now generalize the side information random variable such that each node observes features (side information) each has arbitrary fixed and finite cardinality . The alphabet for each feature is denoted with . Denote, for each node and feature , and , where , and for all . All features are assumed to be independent conditioned on the labels. We first consider the case where is fixed while and are varying with for