Community or module detection, also known as node clustering, is an important task in the study of biological, social and technological networks. Many methods have been proposed to solve this problem, including spectral clusteringVon Luxburg (2007); Newman (2006); Krzakala et al. (2013); modularity optimization Newman and Girvan (2004); Newman (2004); Clauset et al. (2004); Duch and Arenas (2005); statistical inference using generative models, such as the stochastic block model Decelle et al. (2011a, b); Karrer and Newman (2011) and a wide variety of other methods, e.g. Clauset et al. (2004); Blondel et al. (2008); Rosvall and Bergstrom (2008). See Fortunato (2010) for a review.
It was shown in Decelle et al. (2011b, a) that for sparse networks generated by the stochastic block model Holland et al. (1983), there is a phase transition in community detection. This transition was initially established using the cavity method, or equivalently by analyzing the behavior of belief propagation. It was recently established rigorously in the case of two groups of equal size Mossel et al. (2012); Massoulie (2013); Mossel et al. (2013)
. In this case, below this transition, no algorithm can label the nodes better than chance, or even distinguish the network from an Erdős-Rényi random graph with high probability. In terms of belief propagation, there is a factorized fixed point where every node is equally likely to be in every group, and it becomes globally stable at the transition.
For more than two groups, there is an additional regime where the factorized fixed point is locally stable, but another, more accurate, fixed point is locally stable as well. This regime lies between two spinodal transitions: the easy/hard transition where the factorized fixed point becomes locally unstable, so that efficient algorithms can achieve a high accuracy (also known as the Kesten-Stigum transition or the robust reconstruction threshold) and the transition where the accurate fixed point first appears (also known as the reconstruction threshold). In between these two, there is a first order phase transition, where the Bethe free energy of these two fixed points cross. This is the detectability transition, in the sense that an algorithm that can search exhaustively for fixed points—which would take exponential time—would choose the accurate fixed point above this transition. However, below this transition there are exponentially many competing fixed points, each corresponding to a cluster of assignments, and even an exponential-time algorithm has no way to tell which is the correct one. (Note that, of these three transitions, the detectability transition is the only true thermodynamic phase transition; the others are dynamical.)
In between the first order phase and easy/hard transitions, there is a “hard but detectable” regime where the communities can be identified in principle; if we could perform an exhaustive search, we would choose the accurate fixed point since it has lower free energy. In Bayesian terms, the correct block model has larger total likelihood than an Erdős-Rényi graph. However, the accurate fixed point has a very small basin of attraction, making it exponentially hard to find—unless we have some additional information.
Here we model this additional information as a so-called semisupervised learning problem (e.g. Chapelle et al. (2006)) where we are given the true labels of some small fraction of the nodes. This information shifts the location of these transitions; in essence, it destabilizes the factorized fixed point, and pushes us towards the basin of attraction of the accurate one. As a result, for some values of the block model parameters, there is a discontinuous jump in the accuracy as a function of . Roughly speaking, for very small our information is local, consisting of the known nodes and good guesses about nodes in their vicinity: but at a certain belief propagation causes this information to percolate, giving us high accuracy throughout the network. As we vary the block model parameters, this line terminates at the point where the two spinodals and first order phase transitions all meet. At that critical point there is a second-order phase transition, and beyond that point the accuracy is a continuous function of .
Semisupervised learning is an important task in machine learning, in settings where hidden variables or labels are expensive and time-consuming to obtain. Semisupervised community detection was studied in several previous papersAllahverdyan et al. (2010); Steeg et al. (to appear); Eaton and Mansbach (2012). The conclusion of Allahverdyan et al. (2010) was that the detectability transition disappears for any . Later Steeg et al. (to appear) suggested that in some cases it survives for more than two groups. However, both these works were based on an approximate (zero temperature replica symmetric) calculation that corresponds to a far-from-optimal algorithm; moreover, it is known to lead to unphysical results in many other models such as graph coloring (which is a special case of the stochastic block model) or random -SAT Zdeborová and Krzakala (2007); Mézard and Zecchina (2002).
In the present paper we investigate semisupervised community detection using the cavity method and belief propagation, which in sparse graphs is believed to be Bayes-optimal in the limit . From a physics point of view, our results settle the question of what exactly happens in the semisupervised setting, including how the reconstruction, detectability, and easy/hard transitions vary as a function of . From the point of view of mathematics, our calculations provide non-trivial conjectures that we hope will be amenable to rigorous proof.
Our calculations follow the same methodology as those carried out in two other problems:
Study of the ideal glass transition by random pinning.
An important property of Bayes-optimal inference is that the true configuration cannot be distinguished from other configurations that are sampled at random from the posterior probability measure. This is why considering a disordered system similar to the one in this paper and fixing the value or position of a small fraction of nodes in a randomly chosen equilibrium configurations is formally the same problem to semisupervised learning. The analysis of systems with pinned particles was done in order to better understand the formation of glasses inCammarota and Biroli (2012, 2013).
Analysis of belief propagation guided decimation. belief propagation with decimation is a very interesting solver for a wide range of random constraint satisfaction problems. Its performance was analyzed in Montanari et al. (2007); Ricci-Tersenghi and Semerjian (2009). If we decimate a fraction of the variables (i.e., fix them to particular values) this affects the further performance of the algorithm in a way similar to semisupervised learning.
For random -SAT, a large part of this picture has been made rigorous Coja-Oghlan (2011). Our hope is that similar techniques will apply to our results here. As a first step, very recent work Kanade et al. (2014) shows that semisupervised learning does indeed allow for partial reconstruction below the detectability threshold for groups.
The paper is organized as follows. Section II includes definitions and the description of the stochastic block model. In Sections III we consider semisupervised learning in the networks generated by stochastic block model. In Section IV we consider semisupervised learning in two real-world networks, finding transitions in the accuracy at a critical value of qualitatively similar to our analysis for the block model. We conclude in Section V.
Ii The Stochastic Block Model, Belief Propagation, and Semisupervised Learning
The stochastic block model is defined as follows. Nodes are split into groups, where each group contains an expected fraction of the nodes. Edge probabilities are given by a matrix . We generate a random network with nodes as follows. First, we choose a group assignment by assigning each node a label chosen independently with probability . Between each pair of nodes and , we then add an edge between them with probability . For now, we assume that the parameters ,
(a vector denoting), and (a matrix denoting ) are known.
The likelihood of generating given the parameters and the labels is
the Gibbs distribution of the labels , i.e., their posterior distribution given , can be computed via Bayes’ rule,
In this paper, we consider sparse networks where for some constant matrix . In this case, the marginal probability that a given node has label can be computed using belief propagation. The idea of BP is to replace these marginals with “messages” from to each of its neighbors
, which are estimates of these marginals based on’s interactions with its other neighbors Yedidia et al. (2001); Mézard and Parisi (2001). We assume that the neighbors of each node are conditionally independent of each other; equivalently, we ignore the effect of loops. For the stochastic block model, we obtain the following update equations for these messages Decelle et al. (2011b, a):
Here is a normalization factor and is an adaptive external field that enforces the expected group sizes,
where the marginal probability that is given by
In the usual setting, we start with random messages, and apply the BP equations (3) until we reach a fixed point. In order to predict the node labels, we assign each node to its most-likely label according to its marginal:
Fixed points of the BP equations are stationary points of the Bethe free energy Yedidia et al. (2001), which up to a constant is
If there is more than one fixed point, the one with the lowest gives an optimal estimate of the marginals, since is minus the logarithm of the total likelihood of the block model. However, as we comment above, if the optimal fixed point has a small basin of attraction, then finding it through exhaustive search will take exponential time. Analyzing stability of instability of these fixed points, including the trivial or “factorized” fixed point where , leads to the phase transitions in the stochastic block model described in Decelle et al. (2011b, a).
It is straightforward to adapt the above formalism to the case of semisupervised learning. One uses exactly the same equations except that nodes whose labels have been revealed have fixed messages. If we know that , then for all we have
Equivalently, we can define a local external field, replacing the global parameter with in (3). Then
In this paper we focus on a widely-studied special case of the stochastic block model, also well-known as the planted partition model, where the groups are of equal size, i.e., ,, and where takes only two values:
In that case, the average degree is . It is common to parametrize this model with the ratio . When is small, nodes are connected only to others in the same group; at the network is an Erdős-Rényi graph, where every pair of nodes is equally likely to be connected.
We assume here that the parameters are known. If they are unknown, inferring them from the graph and partial information about the nodes is an interesting learning problem in its own right. One can estimate them from the set of known edges, i.e., those where both endpoints have known labels: in the sparse case there are such edges, or if the fraction of known labels is constant. However, for , say, there are very few such edges until or
. Alternately, we can learn the parameters using the expectation-maximization (EM) algorithm ofDecelle et al. (2011b, a), which minimizes the Bethe free energy, or a hybrid method where we initialize EM with parameters estimated from the known edges, if any.
Iii Results on the Stochastic Block Model and the Fate of the Transitions
First we investigate semisupervised learning for assortative networks, i.e., the case . As shown in Decelle et al. (2011a), in the unsupervised case there is a phase transition at
where the factorized fixed point goes from stable to unstable. Below this transition the overlap, i.e., the fraction of correctly assigned nodes, is , no better than random chance. For this phase transition is second-order: the overlap is continuous, but with discontinuous derivative at the transition. For , it becomes an “easy/hard” transition, with a discontinuity in the overlap when we jump from the factorized fixed point to the accurate one. In both cases, the convergence time (the number of iterations BP takes to reach a fixed point) diverges at the transition.
In Fig. 1 we show the overlap achieved by BP for two different values of and various values of . In each case, we hold the average degree fixed and vary . On the left, we have . Here, analogous to the unsupervised case , the overlap is a continuous function of . Moreover, for the detectability transition disappears: the overlap becomes a smooth function, and the convergence time no longer diverges. This picture agrees qualitatively with the approximate analytical results in Allahverdyan et al. (2010); Steeg et al. (to appear).
On the right-hand side of Fig. 1, we show experimental results with . Here the easy/hard transition persists for sufficiently small , with a discontinuity in the overlap and a diverging convergence time. At a critical value of , the transition disappears, and the overlap becomes a smooth function of ; beyond that point the convergence time has a smooth peak but does not diverge. Thus there is a line of discontinuities, ending in a second-order phase transition at a critical point. We show this line in the -plane in Fig. 2. On the left, we see the discontinuity in the overlap, and on the right we see that the convergence time diverges along this line.
Note that the authors of Steeg et al. (to appear) also predicted the survival of the easy/hard discontinuity in the assortative case. Their approximate computation, however, overestimates the strength of the phase transition, and misplaces its position. In particular, it predicts the discontinuity for all , whereas it holds only for .
The full physical picture of what happens to the “hard but detectable” regime in the semisupervised case, and to the spinodal and detectability transitions that define it, is very interesting. To explain it in detail we focus on the disassortative case, and specifically the case of planted graph coloring where . The situation for the assortative case is qualitatively similar, but for graph coloring the discontinuity in the overlap is very strong and appears for any , making these phenomena easier to see numerically.
Fig. 3, on the left, shows the overlap and convergence time of BP for colors. In the unsupervised case , there are a total of three transitions as we decrease (making the problem of recovering the planted coloring harder). The overlap jumps at the easy/hard spinodal transition, where the factorized fixed point becomes stable: this occurs at . At the lower spinodal transition, the accurate fixed point disappears. In between these two spinodal transitions, both fixed points exist. Their Bethe free energies cross at the detectability transition: below this point, even a Bayesian algorithm with the luxury of exhaustive search would do no better than chance. Thus the “hard but detectable” regime lies in between the detectability and easy/hard transitions Zdeborová and Krzakala (2007).
On the right of Fig. 3, we plot the two spinodal transitions, and the detectability transition in between them, in the -plane. We see that these transitions persist up to a critical value of . At that point, all three meet at a second-order phase transition, beyond which the overlap is a smooth function. The very same picture arises in the two related problems mentioned in the introduction, namely the glass transition with random pinning and BP-guided decimation in random -SAT; see e.g. Fig. 1 in Cammarota and Biroli (2012, 2013) and Fig. 3 in Ricci-Tersenghi and Semerjian (2009).
Finally, in Fig. 4 we plot the overlap and convergence time for the planted -coloring problem in the -plane. Analogous to the assortative case in Fig. 2, but more visibly, there is a line of discontinuities in the overlap along which the convergence time diverges; the height of the discontinuity decreases until we reach the critical point.
Iv Results on Real-World Networks
In this section we study semisupervised learning in real-world networks. Real-world networks are of course not generated by the stochastic block model; however, the block model can often achieve high accuracy for community detection, if its parameters are fit to the network.
To explore the effect of semisupervised learning, we set the parameters and in two different ways. In the first way, the algorithm is given the best possible values of these parameters in advance, as determined by the ground truth labels: this is cheating, but it separates the effect of being given node labels from the process of learning the parameters. In the second (more realistic) way, the algorithm uses the expectation-maximization (EM) algorithm of Decelle et al. (2011b, a), which minimizes the free energy. As discussed in Section II, we initialize the EM algorithm with parameters estimated from edges where both endpoints have known labels, if any.
We test two real networks, namely a network of political blogs Adamic and Glance (2005) and Zachary’s karate club network. The blog network is composed of blogs and links between them that were active during the 2004 US elections; human curators labeled each blog as liberal or conservative. In Fig. 5 we plot the overlap between the inferred labels and the ground truth labels, with multiple independent runs of BP with different initial labels. This network is known not to be well-modeled by the stochastic block model, since the highest-likelihood SBM splits the nodes into a core-periphery structure with high-degree nodes in the core and low-degree nodes outside it, instead of dividing the network along political lines Karrer and Newman (2011). Indeed, as the top left panel shows, even given the correct parameters, BP often falls into a core-periphery structure instead of the correct one. However, once is sufficiently large, we move into the basin of attraction of the correct division.
On the top right of Fig. 5, we see a similar transition, but now because the EM algorithm succeeds in learning the correct parameters. There are two fixed points of the learning process in parameter space, corresponding to the high/low degree division and the political one. Both of these are local minima of the free energy Zhang et al. (2012). As increases, the correct one becomes the global minimum, and its basin of attraction gets larger, until the fraction of runs that arrive at the correct parameters (and therefore an accurate partition) becomes large.
We show this learning process in the lower panels of Fig. 5. Since there are just two groups, determines the group sizes, where we break symmetry by taking to be the smaller group. As increases, we move from to . On the lower right, we see the parameters change from a core-periphery structure with to an assortative one with .
Our second example is Zachary’s Karate club Zachary (1977), which represents friendship patterns between the 34 members of a university karate club which split into two factions. As with the blog network, it has two local optima in parameter place, one corresponding to a high/low degree division (which in the unsupervised case has lower free energy) and the other into the two factions Decelle et al. (2011a). We again do two types of experiments, one where the best parameters known in advance, and the other where we learn these parameters with the EM algorithm. Our results are showin in Fig 6 and are similar to Fig. 5. As increases, the overlap improves, in the first case because the known labels push us into the basin of attraction of the correct division, and in the second case because the EM algorithm finds the correct parameters.
As discussed elsewhere Karrer and Newman (2011); Decelle et al. (2011a), the standard stochastic block model performs poorly on these networks in the unsupervised case. It assumes a Poisson degree distribution within each community, and thus tends to divide nodes into groups according to their degree; in contrast, these networks (and many others) have heavy-tailed degree distributions within communities. A better model for these networks is the degree-corrected stochastic block model Karrer and Newman (2011), which achieves a large overlap on the blog network even when no labels are known. We emphasize that our analysis can easily be carried out for the degree-corrected SBM as well, using the BP equations given in Yan et al. (to appear). On the other hand, it is interesting to observe how, in the semisupervised case, even the standard SBM succeeds in recognizing the network’s structure at a moderate value of .
V Conclusion and discussion
We have studied semisupervised learning in sparse networks with belief propagation and the stochastic block model, focusing on how the detectability and easy/hard transitions depend on the fraction of known nodes. In agreement with previous work based on a zero-temperature approximation Allahverdyan et al. (2010); Steeg et al. (to appear), for groups the detectability transition disappears for . However, for large where there is a hard but detectable phase in the unsupervised case, the easy/hard transition persists up to a critical value of , creating a line of discontinuities in the overlap ending in a second-order phase transition.
We found qualitatively similar behavior in two real networks, where the overlap jumps discontinuously at a critical value of . When the best possible parameters of the block model are known in advance, this happens when the basin of attraction of the correct structure becomes larger; when we learn them with an EM algorithm as in Decelle et al. (2011b, a), it occurs because the optimal parameters become global minima of the free energy. In particular, even though the standard block model is not a good fit to networks like the blog network or the karate club, where each community has a heavy-tailed degree distributions, we found that at a certain value of it switches from a core-periphery structure to the correct assortative structure.
It would be very interesting to apply this formalism to active learning, where rather than learning the labels of a random set of nodes, the algorithm must choose which nodes to explore. One approach to this problem Moore et al. (2011) is to explore the node with the largest mutual information between it and the rest of the network, as estimated by Monte Carlo sampling of the Gibbs distribution, or (more efficiently) using belief propagation. We leave this for future work.
Acknowledgements.C.M. and P.Z. were supported by AFOSR and DARPA under grant FA9550-12-1-0432. We are grateful to Florent Krzakala, Elchanan Mossel, and Allan Sly for helpful conversations.
- Von Luxburg (2007) U. Von Luxburg, Statistics and Computing 17, 395 (2007).
- Newman (2006) M. E. J. Newman, Physical Review E 74, 036104 (2006).
- Krzakala et al. (2013) F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang, Proceedings of the National Academy of Sciences 110, 20935 (2013).
- Newman and Girvan (2004) M. E. J. Newman and M. Girvan, Physical Review E 69, 026113 (2004).
- Newman (2004) M. E. Newman, Physical Review E 69, 066133 (2004).
- Clauset et al. (2004) A. Clauset, M. E. Newman, and C. Moore, Physical Review E 70, 066111 (2004).
- Duch and Arenas (2005) J. Duch and A. Arenas, Physical Review E 72, 027104 (2005).
- Decelle et al. (2011a) A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Physical Review E 84, 066106 (2011a).
- Decelle et al. (2011b) A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Physical Review Lett. 107, 065701 (2011b).
- Karrer and Newman (2011) B. Karrer and M. E. J. Newman, Physical Review E 83, 016107 (2011).
- Blondel et al. (2008) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Journal of Statistical Mechanics: Theory and Experiment 2008, P10008 (2008).
- Rosvall and Bergstrom (2008) M. Rosvall and C. T. Bergstrom, Proceedings of the National Academy of Sciences 105, 1118 (2008).
- Fortunato (2010) S. Fortunato, Physics Reports 486, 75 (2010).
- Holland et al. (1983) P. W. Holland, K. B. Laskey, and S. Leinhardt, Social Networks 5, 109 (1983).
- Mossel et al. (2012) E. Mossel, J. Neeman, and A. Sly, preprint arXiv:1202.1499 (2012).
- Massoulie (2013) L. Massoulie, preprint arXiv:1311.3085 (2013).
- Mossel et al. (2013) E. Mossel, J. Neeman, and A. Sly, preprint arXiv:1311.4115 (2013).
- Chapelle et al. (2006) O. Chapelle, B. Schölkopf, A. Zien, et al., Semi-supervised learning, vol. 2 (MIT press Cambridge, 2006).
- Allahverdyan et al. (2010) A. E. Allahverdyan, G. Ver Steeg, and A. Galstyan, Europhysics Letters 90, 18002 (2010).
- Steeg et al. (to appear) G. V. Steeg, C. Moore, A. Galstyan, and A. E. Allahverdyan, Europhysics Letters (to appear), arxiv.org/abs/1312.0631.
Eaton and Mansbach (2012)
E. Eaton and
R. Mansbach, in
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence(2012).
- Zdeborová and Krzakala (2007) L. Zdeborová and F. Krzakala, Physical Review E 76, 031131 (2007).
- Mézard and Zecchina (2002) M. Mézard and R. Zecchina, Physical Review E 66, 056126 (2002).
- Cammarota and Biroli (2012) C. Cammarota and G. Biroli, Proceedings of the National Academy of Sciences 109, 8850 (2012).
- Cammarota and Biroli (2013) C. Cammarota and G. Biroli, The Journal of Chemical Physics 138, 12A547 (2013).
- Montanari et al. (2007) A. Montanari, F. Ricci-Tersenghi, and G. Semerjian, in Proceedings of the 45th Allerton Conference (2007), p. 352–359.
- Ricci-Tersenghi and Semerjian (2009) F. Ricci-Tersenghi and G. Semerjian, Journal of Statistical Mechanics: Theory and Experiment 2009, P09001 (2009).
- Coja-Oghlan (2011) A. Coja-Oghlan, in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms (SIAM, 2011), pp. 957–966.
- Kanade et al. (2014) V. Kanade, E. Mossel, and T. Schramm, preprint arXiv:1404.6325 (2014).
- Yedidia et al. (2001) J. Yedidia, W. Freeman, and Y. Weiss, in International Joint Conference on Artificial Intelligence (IJCAI) (2001).
- Mézard and Parisi (2001) M. Mézard and G. Parisi, Eur. Phys. J. B 20, 217 (2001).
- Adamic and Glance (2005) L. A. Adamic and N. Glance, in Proceedings of the 3rd International Workshop on Link Discovery (ACM, 2005), pp. 36–43.
- Zhang et al. (2012) P. Zhang, F. Krzakala, J. Reichardt, and L. Zdeborová, Journal of Statistical Mechanics: Theory and Experiment 2012, P12021 (2012).
- Zachary (1977) W. W. Zachary, Journal of Anthropological Research pp. 452–473 (1977).
- Yan et al. (to appear) X. Yan, J. E. Jensen, F. Krzakala, C. Moore, C. R. Shalizi, L. Zdeborová, P. Zhang, and Y. Zhu, Journal of Statistical Mechanics: Theory and Experiment (to appear), arxiv.org/abs/1207.3994.
- Moore et al. (2011) C. Moore, X. Yan, Y. Zhu, J.-B. Rouquier, and T. Lane, in Proceedings of KDD (2011), pp. 841–849.