Phase Transitions in Community Detection: A Solvable Toy Model

12/02/2013 ∙ by Greg Ver Steeg, et al. ∙ USC Information Sciences Institute 0

Recently, it was shown that there is a phase transition in the community detection problem. This transition was first computed using the cavity method, and has been proved rigorously in the case of q=2 groups. However, analytic calculations using the cavity method are challenging since they require us to understand probability distributions of messages. We study analogous transitions in so-called "zero-temperature inference" model, where this distribution is supported only on the most-likely messages. Furthermore, whenever several messages are equally likely, we break the tie by choosing among them with equal probability. While the resulting analysis does not give the correct values of the thresholds, it does reproduce some of the qualitative features of the system. It predicts a first-order detectability transition whenever q > 2, while the finite-temperature cavity method shows that this is the case only when q > 4. It also has a regime analogous to the "hard but detectable" phase, where the community structure can be partially recovered, but only when the initial messages are sufficiently accurate. Finally, we study a semisupervised setting where we are given the correct labels for a fraction ρ of the nodes. For q > 2, we find a regime where the accuracy jumps discontinuously at a critical value of ρ.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A number of recent papers have studied fundamental limits on community detection in the stochastic block model (SBM), a simple generative model of networks with tunable modularity. For networks that are dense enough, with an average degree that grows faster than , the communities can be recovered exactly under some circumstances Bickel-Chen-on-modularity . However, in the sparse case where the average degree is , there is a sharp transition below which the communities are undetectable Reichardt2008 ; Allahverdyan2010a ; Decelle2011PRL ; Decelle2011PRE ; Hu2012 ; Ronhovde2012 ; Nadakuditi2012 . The location of this transition was found using the cavity method Decelle2011PRL ; Decelle2011PRE , or equivalently, by analyzing the behavior of the belief propagation (BP) algorithm. It was also hypothesized that BP is an optimal inference method for community detection in SBM, so that the corresponding detectability threshold is algorithm-independent Decelle2011PRL ; Decelle2011PRE . This hypothesis was proved rigorously in the case of groups mossel-neeman-sly ; massoulie ; mossel-neeman-sly-proof ; in the detectable regime, a polynomial-time algorithm exists that labels nodes correctly with probability bounded above , while in the undetectable regime, graphs generated by the SBM are indistinguishable from Erdős-Rényi random graphs, and no algorithm can label the nodes better than chance.

Below the transition, for belief propagation converges to a paramagnetic fixed point where every node is equally likely to belong to either community. For , however, cavity method calculations Decelle2011PRL ; Decelle2011PRE show that the situation is more complicated, including a “hard but detectable” regime where the communities can be recovered by belief propagation, but only if the algorithm is given a strong initial hint about the correct labels. However, addressing this claim analytically is difficult, given that the cavity method requires us to keep track of an entire probability distribution of messages.

Here we study the community detection problem within the zero-temperature Bethe-Peierls approximation Mezard1987 ; Mezard2001 ; Mezard2003 . Equivalently, we study a message-passing algorithm where the distribution of messages is concentrated on the most likely label of each node. Zero-temperature inference for community detection was also studied in Reichardt2008 . However, we augment this algorithm with a tiebreaking mechanism: Whenever a node has several equally-likely choices for its label, we break the symmetry randomly and uniformly among these choices, in effect applying an infinitesimal random external field. This reduces the number of order parameters considerably, making it possible to study it analytically for any value of .

We emphasize that the zero-temperature randomized message passing method should be thought of as a “toy model” for the real community detection problem. In particular, it overestimates the detectability thresholds, and the corresponding algorithm is far from optimal, as we discuss below. Nevertheless, it reproduces some of the qualitative features of the real system and allows insight into the interesting but analytically difficult regime in which the graph contains many communities.

Our Contributions.

We derive a fixed point equation for the proposed message-passing algorithm, and use it to analytically explore the community detection problem. In this model, we find the detectability transition is continuous for , and discontinuous for ; in contrast, the finite-temperature cavity method Decelle2011PRL ; Decelle2011PRE shows that the transition is continuous for and discontinuous for (in the assortative case). Analogous to the “hard but detectable” regime Decelle2011PRL ; Decelle2011PRE , we also find a regime in which there are two fixed points; a paramagnetic one where all labels are equally likely, and a second one which has high accuracy. In this regime, the algorithm is able to recover the underlying community structure only if the initial messages are sufficiently close to the true labels; otherwise, it converges to the paramagnetic solution.

We also analyze the SBM reconstruction problem in “semisupervised” settings where one is given the true group labels for a fraction of nodes. For , even a tiny amount of prior information suppresses the detectability transition in the zero-temperature model Allahverdyan2010a . Here we show that for , the behavior of the inference problem is much richer. Namely, while the prior information always improves the accuracy, there is a line of discontinuities where the accuracy jumps discontinously at a critical value of , again in qualitative agreement with the cavity method for sufficiently large zhang-moore-zdeborova .

Ii Stochastic Block Models

Consider a network of nodes, where each node belongs to one of groups or communities. Let be the community label of node , and let . The probability of a link between two nodes in groups and is given by a matrix . We focus on the case where depends only on whether the nodes are in the same or different groups: that is, . We assume the network is assortative, i.e., that . Finally, we assume it is sparse, i.e., that and are .

Let be the adjacency matrix of a graph generated by the above block model. The generative model is fully described the following joint probability

(1)

where encodes prior information about the community assignment one might have.

Given the observed network , we are interested in reconstructing the unknown state

. Toward this goal, we define the posterior probability of

given

, which, by the use of Bayes theorem, can be written as follows:

(2)

If the prior is constant, this gives a Gibbs distribution at unit temperature.

There are several approaches for deciding the community assignments from the Gibbs distribution, and different approaches are optimal for different loss functions 

Iba1999 . For instance, the fraction of correctly inferred labels is maximized by computing marginal probabilities for each node, , and choosing the most-likely label for each one. Here we focus on a different approach known as maximum a posterioriestimation that tries to find the state that jointly maximizes . This is the ground state of a generalized Potts Hamiltonian,

(3)

where the second term represents prior knowledge about the community assignments.

Since exact minimization is computationally intractable for large graphs, one has to resort to approximate methods. A popular family of such methods are message-passing algorithms such as belief propagation (BP). When the underlying graph is a tree, BP converges to the true marginals of the Gibbs distribution; although there are no convergence guarantees for general graphs with loops, the typical loop length in SBM scales as , so we expect BP to be asymptotically correct in the thermodynamic limit Decelle2011PRL ; Decelle2011PRE .

If we want to find the ground state rather than the marginals, however, it makes sense to consider a zero-temperature version of belief propagation, where the messages are concentrated on the most-likely labels. We describe this algorithm, and our simplification of it, in the next section.

Iii Zero-temperature Message Passing

In the zero-temperature form of belief propagation, also known as the max-product algorithm, the messages have a particular simple form. Namely, they are binary vectors

, where for all and at least one of the is positive. The message (also referred to as cavity bias) from node to node describes the preferred state of node in the absence of node . To calculate this message, node sums the messages from all its neighboring nodes except , obtaining the cavity field . It then constructs a new message , where the function picks the maximum component of its argument and sets it to , while setting all the other components to zero: . Thus if is one of the most-likely groups for to belong to, given the most-likely group memberships of its neighbors other than .

There are possible messages. Furthermore, the probability of a particular message depends on the true label of the node it originates from, giving probabilities. However, due to symmetry, one can show that there are only relevant order parameters Reichardt2008 . Namely, what matters is (a) whether where is the correct label , and (b) the number of other non-zero entries of . Thus, the cavity field distributions can be parameterized as , where , and .

The fixed point of the message passing procedure can be found by solving the so-called cavity equation. For the SBM, this equation seems to have a closed-form solution only in simple cases, such as  Reichardt2008 ; Allahverdyan2010a ; Hu2012 . In general, one has to resort to numerical methods such as population dynamics. Here one considers a pool of messages that are dynamically updated according to the rules specified above, while choosing the number of neighbors a node has in each group from the appropriate degree distribution. In essence, this simulates the message-passing algorithm within the annealed approximation, where the network is redrawn at each iteration.

Here we modify the message-passing scheme by only allowing messages where exactly one of ’s components is . In our update, if the procedure above gives a message with more than one nonzero component, we break the tie by choosing one of those components with equal probability. In that case, by symmetry and normalization, the only relevant order parameter is , which we can think of as a magnetization.

Iv Analysis for

Below we use and to denote the average number of neighbors a node has in its own group and in each other group respectively. The total connectivity, or average degree, is then . We write as a measure of the strength of the community structure.

We start by analyzing zero-temperature message passing in the case . There are three order parameters, which we denote , , and , corresponding to correct, incorrect, and non-informative messages respectively. The update rule is a majority vote, with corresponding to a tie. Specifically, let be the number of correct messages receives from neighbors in its own group, plus the number of incorrect messages it receives from the other group; and let be the number of incorrect messages it receives from its own group, plus the number of correct messages from the other group. If , then ’s message is correct, incorrect, or uninformative if , , or respectively.

In networks generated by the stochastic block model, the number of neighbors a node has in its own group or in the other group are Poisson-distributed with mean

and respectively (note that we can generalize this to other degree distributions). As in the cavity method, we assume that the messages sent by these neighbors are independent. Thus and are Poisson-distributed with mean and respectively, where

(4)

Their difference is then distributed according to the Skellam distribution:

(5)

where is the modified Bessel function of the first kind. Without tiebreaking, the fixed-point equations of population dynamics are thus

(6)

If we define

(7)

then and are the magnetization and the Edwards-Anderson parameter, respectively. The fixed point equations can be rewritten in terms of these parameters,

(8)
(9)

where we have defined

(10)
(11)

These same equations were obtained via zero-temperature cavity methods in Allahverdyan2010a , but this derivation is considerably simpler.

In the vicinity of the second-order phase transition we linearize (9) around to obtain

(12)

Using the identity , the sum in (12) telescopes, giving an equation for the detection threshold,

(13)

where is given by taking in (8),

(14)

We now consider the case with tiebreaking, flipping a coin whenever . In that case (6) becomes

(15)

and . After some manipulation, we obtain

(16)

but where now , are defined by

(17)

Reasoning as before, we obtain for the threshold

(18)

Comparing (13) and (18) we see that tiebreaking is equivalent to setting the Edwards-Anderson parameter to . In Fig. 1 we show both thresholds as a function of . The threshold with tiebreaking is higher, showing that it can be helpful to report ties rather than break them; this is reminiscent of distributed algorithms for approximate majority AngluinAE2008majority . In fact, the tiebreaking algorithm fails to find communities even when , i.e., where all links are within groups, if , since at this point . In contrast, without tiebreaking for all .

Neither version of zero-temperature inference performs as well as belief propagation. In particular, the detectability thresholds for both methods are noticeably higher than the algorithm-independent threshold predicted by Decelle2011PRL ; Decelle2011PRE and established rigorously in mossel-neeman-sly ; massoulie ; mossel-neeman-sly-proof . Note, however, that those thresholds do scale correctly for large : both (13) and (18) approach as , while the true threshold Decelle2011PRL ; Decelle2011PRE is .

Figure 1: Detection thresholds for zero-temperature inference with and without tiebreaking, scaled by . The true detectability threshold is at ; the zero-temperature thresholds converge to as .

V Analysis for Arbitrary

We now consider the case with tiebreaking for arbitrary . By symmetry, we again have just two types of messages: correct ones with density , and incorrect ones with density . Incorrect messages are spread uniformly over the incorrect groups.

Let denote the number of messages a node receives carrying its own group label. These are either correct messages from neighbors in its group, or a fraction of the incorrect messages from other groups. The expected total number of neighbors has in other groups is , so is Poisson with mean

(19)

For each of the other groups, which we label , let be the number of messages receives with label . Then is Poisson with mean

(20)

The population dynamics then works as follows. Let and let be the number of incorrect colors that achieve this maximum. Then emits a correct message with probability if , and an incorrect message otherwise.

The joint probability that incorrect colors have and that the other have is

(21)

where is the Poisson distribution with mean , and is the regularized Gamma function. The fixed point equation is then where

(22)
Figure 2: Fixed point equation for , , and different values of . From bottom to top, , , , and . Observe that at , the second solution emerges discontinuously, indicating a first order transition.

We illustrate the fixed point equation in Fig. 2 for and . A close inspection reveals that there are several different phases separated by two phase transitions, which we denote and . For , there is a single fixed point corresponding to the paramagnetic solution. At a second solution emerges, giving an accurate labeling of the nodes. This occurs when

(23)

Similarly, is defined by

(24)

For this transition is first order; that is, is bounded above . In fact, the detectability transition (in the assortative case) is continuous for and first-order for  Decelle2011PRL ; Decelle2011PRE , but the zero-temperature model does give some intuition about why it becomes discontinuous at larger .

The population dynamics can be described, in a suitable timescale, as . Therefore, for , both and are locally stable, with an unstable fixed point between them. At , the paramagnetic solution becomes unstable.

Figure 3: The accuracy as a function of for and . The thresholds and correspond to the appearance of the second solution and the instability of the paramagnetic solution respectively. The dashed line corresponds to initial messages accurate enough to converge to , and the solid line corresponds to random initial messages. Compare Fig. 2(c) in Decelle2011PRL .

These results fit qualitatively with the results from the cavity method in Decelle2011PRL ; Decelle2011PRE , albeit with overestimated values of and . If , the communities are undetectable, since the algorithm converges to the paramagnetic fixed point. If , the communities are easy to detect, since a small perturbation away from the paramagnetic fixed point will lead to . Finally, is the “hard but detectable” regime: we can converge to and label the nodes accurately, but only if the initial messages are accurate enough.

In Fig. 3 we plot the accuracy as a function of . The two curves correspond to different ways to initialize the messages; randomly (solid) and accurately enough to converge to (dashed). The gap between the two transitions corresponds to the regime where the non-paramagnetic solution exists but is hard to find.

Figure 4: Zero-temperature thresholds (blue) and (green) as a function of the number of groups for . The red line shows the true hard/easy threshold .

In Fig. 4 we plot the thresholds and as a function of , while keeping fixed. While belief propagation succeeds whenever is above the true easy/hard transition for any , it appears that increases with . This suggests that, when starting from random messages, zero-temperature inference with tiebreaking performs poorly when the number of communities is large.

Vi Semisupervised Inference

So far, we have assumed that the only information available to us is the graph structure. We now focus on “semisupervised” inference, where we are also given some prior information about the true group assignment.

One can distinguish two possible scenarios. In the first, we have noisy information about every node, biasing us toward its correct label. We can represent this by giving the correct label some weight in the tiebreaking rule; then the probability of a correct message in (22) becomes .

In the second scenario, we have information that is perfectly accurate, but limited: namely, we know the true labels of a fraction of the nodes Allahverdyan2010a . Here we define as the accuracy we achieve on the unknown nodes. In that case, we can modify our previous analysis by assuming that a fraction of the incoming messages are from known nodes, and are automatically correct. Thus we replace and in (21) and (22) with

(25)

We found that these two scenarios produce qualitatively similar results, and we focus on the latter one.

Figure 5: (a) Accuracy vs.  for and , and for varying amounts of prior information, . At , we do no better than chance until , as in Fig. 3. However, even a small moves the boundary betwen the easy and hard regimes downward, letting us jump to an accurate fixed point. (b) Accuracy vs. , for the same parameters as in (a), and for three different values of . There is a range of where the accuracy jumps discontinuously at a critical value of . At a critical value of , this discontinuity disappears.

Fig. 5 shows the accuracy as a function of for different amounts of prior information, . We see that even a small value of lets us jump to an accurate fixed point analogous to at some , letting us label the nodes even when we are some distance inside the “hard but detectable” regime. Note that this observation is in stark contrast with the behavior reported in Ref. Allahverdyan2010a , where the detection threshold disappeared for any finite positive .

As shown in Fig. 5, there is a range of where the accuracy jumps discontinuously at a critical value of . These discontinuities disappear at a particular value of , correponding to a tricritical point. Below this the accuracy increases steeply, but continuously, as a function of . This qualitatively reproduces the picture from cavity method calculations for large  zhang-moore-zdeborova .

Vii Discussion

We analyzed community detection in the stochastic block model, based on a zero-temperature message-passing algorithm. By breaking ties randomly, we reduced the number of order parameters to one, giving us an analytically tractable model for any number of groups.

The randomized message passing algorithm considered here is not optimal for the community detection problem. Therefore, any detection thresholds reported here can only be viewed as bounds on the true (algorithm-independent) detection thresholds. Nevertheless, it lets us analytically reproduce some qualitative aspects of the true transition. For it predicts a first-order detectability transition, and a “hard but detectable” regime. We note that the finite-temperature cavity method shows that this regime appears when  Decelle2011PRL ; Decelle2011PRE .

We also analyzed a “semisupervised” setting where one is given the true labels of a fraction of nodes. In contrast to  Allahverdyan2010a , for even a small value of significantly moves the boundary between the hard and easy regimes, and there is a range of where the accuracy jumps discontinuously as a function of . This is again in qualitative agreement with the cavity method zhang-moore-zdeborova .

We limited our analysis to the case where the connectivity between nodes depends only on whether they are in the same group or not, and where the groups are of equal size. Our approach can be generalized to more general cases, although the analysis will be more complicated.

Acknowledgements.
A.G. and G.V.S. thank the Santa Fe Institute for their hospitality. A.G. and G.V.S. were supported in part by the US AFOSR MURI grant FA9550-10-1-0569, and US DTRA grant HDTRA1-10-1-0086. C.M. is supported by the AFOSR and DARPA under grant #FA9550-12-1-0432. We thank Lenka Zdeborová, Pan Zhang, Elchanan Mossel, and Allan Sly for helpful conversations.

References

  • (1) P. J. Bickel, A. Chen, Proceedings of the National Academy of Sciences (USA) 106, 21068 (2009).
  • (2) J. Reichardt, M. Leone, Phys. Rev. Lett. 101, 078701 (2008).
  • (3) A. Allahverdyan, G. V. Steeg, A. Galstyan, Europhys. Lett. 90, 18002 (2010).
  • (4) A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Phys. Rev. Lett. 107, 065701 (2011).
  • (5) A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Phys. Rev. E 84, 066106 (2011).
  • (6) D. Hu, P. Ronhovde, Z. Nussinov, Philosophical Magazine 92, 406 (2012).
  • (7) P. Ronhovde, D. Hu, Z. Nussinov, Europhys. Lett. 99, 38006 (2012).
  • (8) R. R. Nadakuditi, M. E. J. Newman, Phys. Rev. Lett. 108, 188701 (2012).
  • (9) E. Mossel, J. Neeman, A. Sly (2012). Preprint, arxiv.org/abs/1202.1499v4.
  • (10) L. Massoulié (2013). Preprint, arxiv.org/pdf/1311.3085v1.
  • (11) E. Mossel, J. Neeman, A. Sly (2013). Preprint, arxiv.org/abs/1311.4115v1.
  • (12) M. Mézard, G. Parisi, Europhys. Lett. 3, 1067 (1987).
  • (13) M. Mézard, G. Parisi, The European Physical Journal B - Condensed Matter and Complex Systems 20, 217 (2001).
  • (14) M. Mézard, G. Parisi, J. Stat. Phys. 111, 1 (2003).
  • (15) P. Zhang, C. Moore, L. Zdeborová, Phase transitions in semisupervised learning in networks. In progress.
  • (16) Y. Iba, J. Phys. A: Mathematical and General 32, 3875 (1999).
  • (17) D. Angluin, J. Aspnes, D. Eisenstat, Distributed Computing 21, 87 (2008).