Hierarchical structure and the prediction of missing links in networks

11/04/2008 ∙ by Aaron Clauset, et al. ∙ 0

Networks have in recent years emerged as an invaluable tool for describing and quantifying complex systems in many branches of science. Recent studies suggest that networks often exhibit hierarchical organization, where vertices divide into groups that further subdivide into groups of groups, and so forth over multiple scales. In many cases these groups are found to correspond to known functional units, such as ecological niches in food webs, modules in biochemical networks (protein interaction networks, metabolic networks, or genetic regulatory networks), or communities in social networks. Here we present a general technique for inferring hierarchical structure from network data and demonstrate that the existence of hierarchy can simultaneously explain and quantitatively reproduce many commonly observed topological properties of networks, such as right-skewed degree distributions, high clustering coefficients, and short path lengths. We further show that knowledge of hierarchical structure can be used to predict missing connections in partially known networks with high accuracy, and for more general network structures than competing techniques. Taken together, our results suggest that hierarchy is a central organizing principle of complex networks, capable of offering insight into many network phenomena.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

  • (1) S. Wasserman, K. Faust, Social Network Analysis (Cambridge University Press, Cambridge, 1994).
  • (2) R. Albert, A.-L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
  • (3) M. E. J. Newman, SIAM Review 45, 167 (2003).
  • (4) E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L. Barabási, Science 30, 1551 (2002).
  • (5) A. Clauset, M. E. J. Newman, C. Moore, Phys. Rev. E 70, 066111 (2004).
  • (6) R. Guimera, L. A. N. Amaral, Nature 433, 895 (2005).
  • (7) M. C. Lagomarsino, P. Jona, B. Bassetti, H. Isambert, Proc. Natl. Acad. Sci. USA 104, 5516 (2001).
  • (8) D. Liben-Nowell, J. Kleinberg, International Conference on Information and Knowledge Management (2003). It should be noted that this paper focuses on predicting future connections given current ones, as opposed to predicting connections which exist but have not yet been observed.
  • (9) M. Girvan, M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821 (2002).
  • (10) A. E. Krause, K. A. Frank, D. M. Mason, R. E. Ulanowicz, W. W. Taylor, Nature 426, 282 (2003).
  • (11) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Proc. Natl. Acad. Sci. USA 101, 2658 (2004).
  • (12) D. J. Watts, P. S. Dodds, M. E. J. Newman, Science 296, 1302 (2002).
  • (13) J. Kleinberg, Proceedings of the 2001 Neural Information Processing Systems Conference, T. G. Dietterich, S. Becker, Z. Ghahramani, eds. (MIT Press, Cambridge, MA, 2002).
  • (14) G. Palla, I. Derényi, I. Farkas, T. Vicsek, Nature 435, 814 (2005).
  • (15) G. Casella, R. L. Berger, Statistical Inference (Duxbury Press, Belmont, 2001).
  • (16) M. E. J. Newman, G. T. Barkema, Monte Carlo Methods in Statistical Physics (Clarendon Press, Oxford, 1999).
  • (17) M. E. J. Newman, Phys. Rev. Lett. 89, 208701 (2002).
  • (18) M. Huss, P. Holme, IET Systems Biology 1, 280 (2007).
  • (19) V. Krebs, Connections 24, 43 (2002).
  • (20) H. A. Dawah, B. A. Hawkins, M. F. Claridge, Journal of Animal Ecology 64, 708 (1995).
  • (21) D. Bryant, BioConsensus, M. Janowitz, F.-J. Lapointe, F. R. McMorris, B. Mirkin, F. Roberts, eds. (DIMACS, 2003), pp. 163–184.
  • (22) J. A. Dunne, R. J. Williams, N. D. Martinez, Proc. Natl. Acad. Sci. USA 99, 12917 (2002).
  • (23) A. Szilágyi, V. Grimm, A. K. Arakaki, J. Skolnick, Physical Biology 2, S1 (2005).
  • (24) E. Sprinzak, S. Sattath, H. Margalit, J. Mol. Biol. 327, 919 (2003).
  • (25) T. Ito, et al., Proc. Natl. Acad. Sci. USA 98, 4569 (2001).
  • (26) A. Lakhina, J. W. Byers, M. Crovella, P. Xie, Proc. INFOCOM (2003).
  • (27) A. Clauset, C. Moore, Phys. Rev. Lett. 94, 018701 (2005).
  • (28) N. D. Martinez, B. A. Hawkins, H. A. Dawah, B. P. Feifarek, Ecology 80, 1044 (1999).
  • (29) J. A. Hanley, B. J. McNeil, Radiology 143, 29 (1982).
  • (30) M. Sales-Pardo, R. Guimerá, A. A. Moreira, L. A. N. Amaral, Proc. Natl. Acad. Sci. USA 104, 15224 (2007).

Supplementary Information

Appendix A Hierarchical random graphs

Our model for the hierarchical organization of a network is as follows.222Computer code implementing many of the analysis methods described in this paper can be found online at
www.santafe.edu/aaronc/randomgraphs/.
Let  be a graph with  vertices. A dendrogram  is a binary tree with  leaves corresponding to the vertices of . Each of the  internal nodes of  corresponds to the group of vertices that are descended from it. We associate a probability  with each internal node . Then, given two vertices  of , the probability  that they are connected by an edge is  where  is their lowest common ancestor in . The combination  of the dendrogram and the set of probabilities then defines a hierarchical random graph.

Note that if a community has, say, three subcommunities, with an equal probability of connections between them, we can represent this in our model by first splitting one of these subcommunities off, and then splitting the other two. The two internal nodes corresponding to these splits would be given the same probabilities . This yields three possible binary dendrograms, which are all considered equally likely.

We can think of the hierarchical random graph as a variation on the classical Erdős–Rényi random graph . As in that model, the presence or absence of an edge between any pair of vertices is independent of the presence or absence of any other edge. However, whereas in  every pair of vertices has the same probability  of being connected, in the hierarchical random graph the probabilities are inhomogeneous, with the inhomogeneities controlled by the topological structure of the dendrogram  and the parameters . Many other models with inhomogeneous edge probabilities have, of course, been studied in the past. One example is a structured random graph in which there are a finite number of types of vertices with a matrix giving the connection probabilities between them.333F. McSherry, “Spectral Partitioning of Random Graphs.” Proc. Foundations of Computer Science (FOCS), pp. 529–537 (2001)

Appendix B Fitting the hierarchical random graph to data

Now we turn to the question of finding the hierarchical random graph or graphs that best fits the observed real-world network . Assuming that all hierarchical random graphs are a priori equally likely, the probability that a given model 

is the correct explanation of the data is, by Bayes’ theorem, proportional to the posterior probability or

likelihood  with which that model generates the observed network.444G. Casella and R. L. Berger, “Statistical Inference.” Duxbury Press, Belmont (2001). Our goal is to maximize  or, more generally, to sample the space of all models with probability proportional to .

Let  be the number of edges in  whose endpoints have  as their lowest common ancestor in , and let  and , respectively, be the numbers of leaves in the left and right subtrees rooted at . Then the likelihood of the hierarchical random graph is

(1)

with the convention that .

If we fix the dendrogram , it is easy to find the probabilities  that maximize . For each , they are given by

(2)

the fraction of potential edges between the two subtrees of  that actually appear in the graph . The likelihood of the dendrogram evaluated at this maximum is then

(3)

Figure S1 shows an illustrative example, consisting of a network with six vertices.

Figure S1: An example network  consisting of six vertices, and the likelihood of two possible dendrograms. The internal nodes  of each dendrogram are labeled with the maximum-likelihood probability , i.e., the fraction of potential edges between their left and right subtrees that exist in . According to Eq. (3), the likelihoods of the two dendrograms are  and   The second dendrogram is far more likely because it correctly divides the network into two highly-connected subgraphs at the first level.
Figure S2: Each internal node  of the dendrogram has three associated subtrees , and , which can be placed in any of three configurations. (Note that the topology of the dendrogram depends only on the sibling and parent relationships; the order, left to right, in which they are depicted is irrelevant).

It is often convenient to work with the logarithm of the likelihood,

(4)

where  is the Gibbs-Shannon entropy function. Note that each term  is maximized when  is close to  or to , i.e., when the entropy is minimized. In other words, high-likelihood dendrograms are those that partition the vertices into groups between which connections are either very common or very rare.

We now use a Markov chain Monte Carlo method to sample dendrograms 

with probability proportional to their likelihood . To create the Markov chain we need to pick a set of transitions between possible dendrograms. The transitions we use consist of rearrangements of subtrees of the dendrogram as follows. First, note that each internal node  of a dendrogram  is associated with three subtrees: the subtrees  descended from its two daughters, and the subtree  descended from its sibling. As Figure S2 shows, there are two ways we can reorder these subtrees without disturbing any of their internal relationships. Each step of our Markov chain consists first of choosing an internal node  uniformly at random (other than the root) and then choosing uniformly at random between the two alternate configurations of the subtrees associated with that node and adopting that configuration. The result is a new dendrogram . It is straightforward to show that transitions of this type are ergodic, i.e., that any pair of finite dendrograms can be connected by a finite series of such transitions.

Figure S3: Application of our hierarchical decomposition to the network of grassland species interactions. a, Original (blue) and resampled (red) degree distributions. b, Original and resampled distributions of vertex-vertex distances.

Once we have generated our new dendrogram  we accept or reject that dendrogram according to the standard Metropolis–Hastings rule.555M. E. J. Newman and G. T. Barkema, “Monte Carlo Methods in Statistical Physics.” Clarendon Press, Oxford (1999). Specifically, we accept the transition  if  is nonnegative, so that  is at least as likely as ; otherwise we accept the transition with probability 

. If the transition is not accepted, the dendrogram remains the same on this step of the chain. The Metropolis-Hastings rule ensures detailed balance and, in combination with the ergodicity of the transitions, guarantees a limiting probability distribution over dendrograms that is proportional to the likelihood, 

. The quantity  can be calculated easily, since the only terms in Eq. (4) that change from  to  are those involving the subtrees , and  associated with the chosen node.

The Markov chain appears to converge relatively quickly, with the likelihood reaching a plateau after roughly  steps. This is not a rigorous performance guarantee, however, and indeed there are mathematical results for similar Markov chains that suggest that equilibration could take exponential time in the worst case.666E. Mossel and E. Vigoda, “Phylogenetic MCMC Are Misleading on Mixtures of Trees.” Science 309, 2207 (2005) Still, as our results here show, the method seems to work quite well in practice. The algorithm is able to handle networks with up to a few thousand vertices in a reasonable amount of computer time.

We find that there are typically many dendrograms with roughly equal likelihoods, which reinforces our contention that it is important to sample the distribution of dendrograms rather than merely focusing on the most likely one.

Appendix C Resampling from the hierarchical random graph

The procedure for resampling from the hierarchical random graph is as follows.

  1. Initialize the Markov chain by choosing a random starting dendrogram.

  2. Run the Monte Carlo algorithm until equilibrium is reached.

  3. Sample dendrograms at regular intervals thereafter from those generated by the Markov chain.

  4. For each sampled dendrogram , create a resampled graph  with  vertices by placing an edge between each of the  vertex pairs  with independent probability , where  is the lowest common ancestor of  and  in  and  is given by Eq. (2). (In principle, there is nothing to prevent us from generating many resampled graphs from a dendrogram, but in the calculations described in this paper we generate only one from each dendrogram.)

After generating many samples in this way, we can compute averages of network statistics such as the degree distribution, the clustering coefficient, the vertex-vertex distance distribution, and so forth. Thus, in a way similar to Bayesian model averaging,777T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning.” Springer, New York (2001).

we can estimate the distribution of network statistics defined by the equilibrium ensemble of dendrograms.

Figure S4: Further comparison of link prediction algorithms. Data points represent the average ratio between the probability that the top-ranked pair of vertices is in fact connected and the corresponding probability for a randomly-chosen pair, as a function of the fraction of the connections known to the algorithm. For each network, (a, Terrorist associations; b, T. pallidum metabolites; and c, Grassland species interactions), we compare our method with simpler methods such as guessing that two vertices are connected if they share common neighbors, have a high degree product, or have a short path between them.

For the construction of consensus dendrograms such as the one shown in Fig. 2a, we found it useful to weight the most likely dendrograms more heavily, giving them weight proportional to the square of their likelihood, in order to extract a coherent consensus structure from the equilibrium set of models.

Appendix D Predicting missing connections

Our algorithm for using hierarchical random graphs to predict missing connections is as follows.

  1. Initialize the Markov chain by choosing a random starting dendrogram.

  2. Run the Monte Carlo algorithm until equilibrium is reached.

  3. Sample dendrograms at regular intervals thereafter from those generated by the Markov chain.

  4. For each pair of vertices  for which there is not already a known connection, calculate the mean probability  that they are connected by averaging over the corresponding probabilities  in each of the sampled dendrograms .

  5. Sort these pairs  in decreasing order of  and predict that the highest-ranked ones have missing connections.

In general, we find that the top 1% of such predictions are highly accurate. However, for large networks, even the top 1% can be an unreasonably large number of candidates to check experimentally. In many contexts, researchers may want to consider using the procedure interactively, i.e., predicting a small number of missing connections, checking them experimentally, adding the results to the network, and running the algorithm again to predict additional connections.

The alternative prediction methods we compared against, which were previously investigated in888D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social networks.” Proc. Internat. Conf. on Info. and Know. Manage. (2003)., consist of giving each pair of vertices a score, sorting pairs in decreasing order of their score, and predicting that those with the highest scores are the most likely to be connected. Several different types of scores were investigated, defined as follows, where  is the set of vertices connected to .

  1. Common neighbors: score, the number of common neighbors of vertices  and .

  2. Jaccard coefficient: score, the fraction of all neighbors of  and  that are neighbors of both.

  3. Degree product: score, the product of the degrees of  and .

  4. Short paths: score is divided by the length of the shortest path through the network from  to  (or zero for vertex pairs that are not connected by any path).

One way to quantify the success of a prediction method, used by previous authors who have studied link prediction problems, is the ratio between the probability that the top-ranked pair is connected and the probability that a randomly chosen pair of vertices, which do not have an observed connection between them, are connected. Figure S4 shows the average value of this ratio as a function of the percentage of the network shown to the algorithm, for each of our three networks. Even when fully  of the network is missing, our method predicts missing connections about ten times better than chance for all three networks. In practical terms, this means that the amount of work required of the experimenter to discover a new connection is reduced by a factor of , an enormous improvement by any standard. If a greater fraction of the network is known, the accuracy becomes even greater, rising as high as  times better than chance when only a few connections are missing.

We note, however, that using this ratio to judge prediction algorithms has an important disadvantage. Some missing connections are much easier to predict than others: for instance, if a network has a heavy-tailed degree distribution and we remove a randomly chosen subset of the edges, the chances are excellent that two high-degree vertices will have a missing connection and such a connection can be easily predicted by simple heuristics such as those discussed above. The AUC statistic used in the text, by contrast, looks at an algorithm’s overall ability to rank all the missing connections over nonexistent ones, not just those that are easiest to predict.

Finally, we have investigated the performance of each of the prediction algorithms on purely random (i.e., Erdős–Rényi) graphs. As expected, no method performs better than chance in this case, since the connections are completely independent random events and there is no structure to discover. We also tested each algorithm on a graph with a power-law degree distribution generated according to the configuration model.999M. Molloy and B. Reed, “A critical point for random graphs with a given degree sequence.” Random Structures and Algorithms 6, 161–179 (1995) In this case, guessing that high-degree vertices are likely to be connected performs quite well, whereas the method based on the hierarchical random graph performs poorly since these graphs have no hierarchical structure to discover.