Detecting communities (or clusters) in graphs is a fundamental problem that has been studied in various fields, statistics [3, 4, 5, 6, 7], computer science [8, 9, 10, 11, 12] and theoretical statistical physics [13, 14]. It has many applications: finding like-minded people in social networks , improving recommendation systems , detecting protein complexes . In this paper, we consider the problem of finding a single sub-graph (community) hidden in a large graph, where the community size is much smaller than the graph size. Applications of finding a hidden community include fraud activity detection [18, 19] and correlation mining .
Several models have been studied for random graphs that exhibit a community structure . A widely used model in the context of community detection is the stochastic block model (SBM) . In this paper, the stochastic block model for one community is considered [23, 24, 25, 26]. The stochastic block model for one community consists of a graph of size with a community of size
, where any two nodes are connected with probabilityif they are both within the community, and with probability otherwise.
The problem of finding a hidden community upon observing only the graph has been studied in [23, 24, 25]. The information limits111 The extremal phase transition threshold is also known as
The extremal phase transition threshold is also known asinformation theoretic limit  or information limit . We use the latter term throughout this paper. of weak recovery and exact recovery have been studied in . Weak recovery is achieved when the expected number of misclassified nodes is , and exact recovery when all labels are recovered with probability approaching one. The limits of belief propagation for weak recovery have been characterized [25, 23] in terms of a signal-to-noise ratio parameter . The utility of a voting procedure after belief propagation to achieve exact recovery was pointed out in .
Graphical models are popular because they represent many large data sets and give insight on the performance of inference algorithms, but also in many inference problems they do not capture all data that is both relevant and available. In many practical applications, non-graphical relevant information is available that can aid the inference. For example, social networks such as Facebook and Twitter have access to other information other than the graph edges such as date of birth, nationality, school. A citation network has the authors’ names, keywords, and therefore may provide significant additional information beyond the co-authoring relationships. This paper characterizes the utility of side information in single-community detection, in particular exploring when and by how much can side information improve the information limit, as well as the phase transition of belief propagation, in single-community detection.
We model a varying quantity and quality of side information by associating with each node a vector (i.e., non-graphical) observation whose dimension represents the quantity of side information and whose (element-wise) log-likelihood ratios (LLRs) with respect to node labels represents the quality of side information. The contributions of this paper can be summarized as follows:
The information limits in the presence of side information are characterized. When the dimension of side information for each node varies but its LLR is fixed across
, tight necessary and sufficient conditions are calculated for both weak and exact recovery. Also, it is shown that under the same sufficient conditions, weak recovery is achievable even when the size of the community is random and unknown. We also find conditions on the graph and side information where achievability of weak recovery implies achievability of exact recovery. Subject to some mild conditions on the exponential moments of LLR, the results apply to both discrete as well as continuous-valued side information.
When the side information for each node has fixed dimension but varying LLR, we find tight necessary and sufficient conditions for exact recovery, and necessary conditions for weak recovery. Under varying LLR, our results apply to side information with finite alphabet.
The phase transition of belief propagation in the presence of side information is characterized, where we assume the side information per node has a fixed dimension. When the LLRs are fixed across
, tight necessary and sufficient conditions are calculated for weak recovery. Furthermore, it is shown that when belief propagation fails, no local algorithm can achieve weak recovery. It is also shown than belief propagation is strictly inferior to the maximum likelihood detector. Numerical results on finite synthetic data-sets validate our asymptotic analysis and show the relevance of our asymptotic results to even graphs of moderate size. We also calculate conditions under which belief propagation followed by a local voting procedure achieves exact recovery.
When the side information has variable LLR across , the belief propagation misclassification rate was calculated using density evolution. Our results generalize , where it was shown that belief propagation achieves weak recovery for only for binary side information consisting of noisy labels with vanishing noise.
We now present a brief review of the literature in the area of side information for community detection and highlight the distinctions of the present work. In the context of detecting two or more communities: Mossel and Xu 
showed that, under certain condition, belief propagation with noisy label information has the same residual error as the maximum a-posteriori estimator for two symmetric communities. Caiet. al  studied weak recovery of two symmetric communities under belief propagation upon observing a vanishing fraction of labels. Neither  nor  establishes a converse. For two symmetric communities, Saad and Nosratinia [29, 30] studied exact recovery under side information. Asadi  studied the effect of i.i.d. vectors of side information on the phase transition of exact recovery for more than two communities. Kanade et. al  showed that observation of a vanishing number of labels is unhelpful to correlated recovery222Correlated recovery denotes probability of error that is strictly better than a random guess, and is not a subject of this paper. phase transition. For single community detection, Kadavankandy et al.  studied belief propagation with noisy label information with vanishing noise (unbounded LLRs).
The issue of side information in the context of single-community detection has not been addressed in the literature except for  whose results are generalized in this paper. Analyzing the effect of side information on information limit of weak recovery is a novel contribution of this work. A converse for the local algorithms such as belief propagation with side information has not been available prior to this work. The study of side information whose LLRs vary with is largely novel. And finally, while this work (inevitably) shares many tools and techniques with other works in the area of stochastic block models and community detection, the treatment of side information with variable LLR (as a function of ) presents new challenges for the bounding of errors by the application of Chernoff bound and large deviations, which are addressed in this work.
Ii System Model and Definitions
Let be a realization from a random ensemble of graphs , where each graph has nodes and contains a hidden community with size . The underlying distribution of the graph is as follows: an edge connects a pair of nodes with probability if both nodes are in and with probability otherwise. is the indicator of an edge between nodes . For each node , a vector of dimension is observed consisting of side information, whose distribution depends on the label of the node. By convention if and if . For node , the entries of the side information vector are each denoted and can be interpreted as different features of the side information. The side information for the entire graph is collected into the matrix . The column vector collects the side information feature for all nodes .
The vector of true labels is denoted . and
are Bernoulli distributions with parameters, respectively, and
is the log-likelihood ratio of edge with respect to and .
In this paper, we address the problem of single-community detection, i.e., recovering from and , under the following conditions: while , , and .
An estimator is said to achieve exact recovery of if, as , . An estimator is said to achieve weak recovery if, as , in probability, where denotes the Hamming distance. It was shown in  that the latter definition is equivalent to the existence of an estimator such that . This equivalence will be used throughout our paper.
Iii Information Limits
Iii-a Fixed-Quality Features
In this subsection, the side information for each node is allowed to evolve with by having a varying number of independent and identically distributed scalar observations, each of which has a finite (imperfect) amount of information about the node label. By allowing the dimension of the side information per-node to vary and its scalar components to be identically distributed, the side information is represented with fixed-quality quanta. The results of this section demonstrate that as grows, the number of these side information quanta per-node must increase in a prescribed fashion in order to have a positive effect on the threshold for recovery.
For all , for all , define the distributions:
Thus the components of the side information for each node (features) are identically distributed for all nodes and all graph sizes ; we also assume all features are independent conditioned on the node labels . The dimension of the side information per node is allowed to vary as the size of the graph changes.
In addition, we assume
are such that the resulting LLR random variable, defined below, has bounded support:
Throughout the paper, will continue to denote the LLR random variable of one side information feature, and denotes the random variable of the LLR of a graph edge.
where , and .
Iii-A1 Weak Recovery
For single community detection under bounded-LLR side information, weak recovery is achieved if and only if:
Theorem 1 shows that if grows with slowly enough, e.g., if is fixed and independent of , or if , side information does not affect the information limits.
If the features are conditionally independent but not identically distributed, it is easy to show the necessary and sufficient conditions are:
where and are analogous to and earlier, except specialized to each feature.
The assumption that the size of the community is known a-priori is not always reasonable: we might need to detect a small community whose size is not known in advance. In that case, the performance is characterized by the following lemma.
Please see Appendix D ∎
Iii-A2 Exact Recovery
The sufficient conditions for exact recovery are derived using a two-step algorithm (see Table I). Its first step consists of any algorithm achieving weak recovery, e.g. maximum likelihood (see Lemma 1). The second step applies a local voting procedure.
Define and assume achieves weak recovery, i.e.
Please see Appendix E. ∎
Then the main result of this section follows:
In single community detection under bounded-LLR side information, assume (5) holds, then exact recovery is achieved if and only if:
The assumption that (5) holds is necessary because otherwise weak recovery is not achievable, and by extension, exact recovery.
Theorem 2 shows if grows with slowly enough, e.g., is fixed and independent of or , side information will not affect the information limits of exact recovery.
To illustrate the effect of side information on information limits, consider the following example:
for positive constants . Then, , and hence, weak recovery is achieved without side information, and by extension, with side information. Moreover, exact recovery without side information is achieved if and only if:
Assume noisy label side information with error probability . By Theorem 2, exact recovery is achieved if and only if:
If , then (13) reduces to (12), thus side information does not improve the information limits of exact recovery. If , then since . It follows that (13) is less restrictive than (12), thus improving the information limit.
Iii-B Variable-Quality Features
In this section, the number of features, , is assumed to be constant but the LLR of each feature is allowed to vary with .
Iii-B1 Weak Recovery
Recall that the probability distribution side information feature is when the node is inside and outside the community, and when the node is outside the community.
Theorem 3 (Necessary Conditions for Weak Recovery).
For single community detection under bounded-LLR side information, weak recovery is achieved only if:
The proof follows similar to Theorem 1. ∎
Iii-B2 Exact Recovery
We begin by concentrating on the following regime, and will subsequently show its relation to the set of problems that are both feasible and interesting.
with constants and .
The alphabet for each feature is denoted with , where is the cardinality of feature which, in this section, is assumed to be bounded and constant across . The likelihoods of the features are defined as follows:
Recall that in our side information model, all features are independent conditioned on the labels. To ensure that the quality of the side information is increasing with , both and are assumed to be either constant or monotonic in .
To better understand the behavior of information limits, we categorize side information outcomes based on the trends of LLR and likelihoods. For simplicity we speak of trends for one feature; extension to multiple features is straight forward. An outcome is called informative if and non-informative if . An outcome is called rare if and not rare if . Among the four different combinations, the worst case is when the outcome is both non-informative and not rare for nodes inside and outside the community. We will show that if such an outcome exists, then side information will not improve the information limit. The best case is when the outcome is informative and rare for the nodes inside the community, or for the nodes outside the community, but not both. Two cases are in between: (1) an outcome that is non-informative and rare for nodes inside and outside the community and (2) an outcome that is informative and not rare for nodes inside and outside the community. It will be shown that the last three cases can affect the information limit under certain conditions.
For convenience we define:
We introduce the following functions whose value, as shown in the sequel, characterizes the exact recovery threshold:
The LLR of each feature is denoted:
We also define the following functions of the likelihood and LLR of side information, whose evolution with is critical to the phase transition of exact recovery .
In the following, the side information outcomes are represented by their index without loss of generality. Throughout, dependence on of outcomes and their likelihood is implicit.
In the regime characterized by (16), assume is constant and and are either constant or monotonic in . Then, necessary and sufficient conditions for exact recovery depend on side information statistics in the following manner:
If there exists any sequence (over ) of side information outcomes such that , , are all , then must hold.
If there exists any sequence (over ) of side information outcomes such that and evolve according to with , then must hold.
If there exists any sequence (over ) of side information outcomes such that with and furthermore , then must hold.
If there exists any sequence (over ) of side information outcomes such that with and furthermore , then must hold.
If there exists any sequence (over ) of side information outcomes such that with and furthermore , then must hold.
Theorem 4 does not address because it leads to a trivial problem. For example, for noisy label side information, if the noise parameter , then side information alone is sufficient for exact recovery. Also, when with , a necessary condition is easily obtained but a matching sufficient condition for this case remains unavailable.
In the following, we specialize the results of Theorem 4 to noisy-labels and partially-revealed-label side information.
Figure 2 shows the error exponent for the noisy label side information as a function of .
For side information consisting of a fraction of the labels revealed, Theorem 4 states that exact recovery is achieved if and only if:
Figure 3 shows the error exponent for partially revealed labels, as a function of .
We now comment on the coverage of the regime (16). If the average degree of a node is , then the graph will have isolated nodes and exact recovery is impossible. If the average degree of the node is , then the problem is trivial. Therefore the regime of interest is when the average degree is . This restricts and in a manner that is reflected in (16). Beyond that, in the system model of this paper , so is either or approaching a constant . The regime (16) focuses on the former, but the proofs are easily modified to cover the latter. For the convenience of the reader, we highlight the places in the proof where a modification is necessary to cover the latter case.
Iv Belief Propagation
Belief propagation for recovering a single community was studied without side information in [25, 23] in terms of a signal-to-noise ratio parameter , showing that weak recovery is achieved if and only if . Moreover, belief propagation followed by a local voting procedure was shown to achieve exact recovery if , as long as information limits allow exact recovery.
In this section , i.e. we consider scalar side information random variables that are discrete and take value from an alphabet size . Extension to a vector side information is straight forward as long as dimensionality is constant across ; the extension is outlined in Corollary 3.
Denote the expectation of the likelihood ratio of the side information conditioned on by:
By definition, , where is the chi-squared divergence between the conditional distributions of side information. Thus, .
Iv-a Bounded LLR
We begin by demonstrating the performance of belief propagation algorithm on a random tree with side information. Then, we show that the same performance is possible on a random graph drawn from , using a coupling lemma  expressing local approximation of random graphs by trees.
Iv-A1 Belief Propagation on a Random Tree with Side Information
We model random trees with side information in a manner roughly parallel to random graphs. Let be an infinite tree with nodes , each of them possessing a label . The root is node . The subtree of depth rooted at node is denoted . For brevity, the subtree rooted at with depth is denoted . Unlike the random graph counterpart, the tree and its node labels are generated together as follows: is a Bernoulli- random variable. For any , the number of its children with label is a random variable that is Poisson with parameter if , and Poisson with parameter if . The number of children of node with label is a random variable which is Poisson with parameter , regardless of the label of node . The side information takes value in a finite alphabet . The set of all labels in is denoted with , all side information with , and the labels and side information of with and respectively. The likelihood of side information continues to be denoted by , as earlier.
The problem of interest is to infer the label given observations and . The error probability of an estimator can be written as:
The maximum a posteriori (MAP) detector minimizes and can be written in terms of the log-likelihood ratio as , where and:
The probability of error of the MAP estimator can be bounded as follows :
Let denote the children of node , and . Then,
See Appendix L ∎
Lower and Upper Bounds on
Define for and any node :
Then, and . Let and denote random variables drawn according to the distribution of conditioned on and , respectively. Similarly, let and denote random variables drawn according to the distribution of conditioned on and , respectively. Thus, . Define:
Let . Then:
See Appendix M. ∎
Thus to bound , lower and upper bounds on are needed.
For all , if , then .
See Appendix N. ∎
Define and . Assume that . Then,
See Appendix O. ∎
The sequences and are non-decreasing in .
The proof follows directly from [25, Lemma 5]. ∎
Define to be the number of times the logarithm function must be iteratively applied to to get a result less than or equal to one. Let and . Suppose . Then there are constants and depending only on and such that:
whenever and .
See Appendix P. ∎
Achievability and Converse for the MAP Detector
Let , and . If , then:
If , then:
Moreover, since :
for some .
Iv-A2 Belief Propagation Algorithm for Community Recovery with Side Information
In this section, the inference problem defined on the random tree is coupled to the problem of recovering a hidden community with side information. This can be done via a coupling lemma  that shows that under certain conditions, the neighborhood of a fixed node in the graph is locally a tree with probability converging to one, and hence, the belief propagation algorithm defined for random trees in Section IV-A1 can be used on the graph as well. The proof of the coupling lemma depends only on the tree structure, implying that it also holds for our system model, where the side information is independent of the tree structure given the labels.
Define to be the subgraph containing all nodes that are at a distance at most from node and define and to be the set of labels and side information of all nodes in , respectively.
Lemma 10 (Coupling Lemma ).
Suppose that are positive integers such that . Then:
If the size of community is deterministic and known, i.e., , then for any node in the graph, there exists a coupling between and such that:
where for convenience of notation, the dependence of on is made implicit.
If obeys a probability distribution so that with , then for any node , there exists a coupling between and such that:
Now, we are ready to present the belief propagation algorithm for community recovery with bounded side information. Define the message transmitted from node to its neighboring node at iteration as:
where , is the set of neighbors of node and . The messages are initialized to zero for all nodes , i.e., for all and . Define the belief of node at iteration as:
Algorithm II presents the proposed belief propagation algorithm for community recovery with side information.
|Belief Propagation Algorithm|
If in Algorithm II we have , according to Lemma 10 with probability converging to one , where was the log-likelihood defined for the random tree. Hence, the performance of Algorithm II is expected to be the same as the MAP estimator defined as , where . The only difference is that the MAP estimator decides based on while Algorithm II selects the largest