References
 (1) S. Wasserman, K. Faust, Social Network Analysis (Cambridge University Press, Cambridge, 1994).
 (2) R. Albert, A.L. Barabási, Rev. Mod. Phys. 74, 47 (2002).
 (3) M. E. J. Newman, SIAM Review 45, 167 (2003).
 (4) E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.L. Barabási, Science 30, 1551 (2002).
 (5) A. Clauset, M. E. J. Newman, C. Moore, Phys. Rev. E 70, 066111 (2004).
 (6) R. Guimera, L. A. N. Amaral, Nature 433, 895 (2005).
 (7) M. C. Lagomarsino, P. Jona, B. Bassetti, H. Isambert, Proc. Natl. Acad. Sci. USA 104, 5516 (2001).
 (8) D. LibenNowell, J. Kleinberg, International Conference on Information and Knowledge Management (2003). It should be noted that this paper focuses on predicting future connections given current ones, as opposed to predicting connections which exist but have not yet been observed.
 (9) M. Girvan, M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821 (2002).
 (10) A. E. Krause, K. A. Frank, D. M. Mason, R. E. Ulanowicz, W. W. Taylor, Nature 426, 282 (2003).
 (11) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Proc. Natl. Acad. Sci. USA 101, 2658 (2004).
 (12) D. J. Watts, P. S. Dodds, M. E. J. Newman, Science 296, 1302 (2002).
 (13) J. Kleinberg, Proceedings of the 2001 Neural Information Processing Systems Conference, T. G. Dietterich, S. Becker, Z. Ghahramani, eds. (MIT Press, Cambridge, MA, 2002).
 (14) G. Palla, I. Derényi, I. Farkas, T. Vicsek, Nature 435, 814 (2005).
 (15) G. Casella, R. L. Berger, Statistical Inference (Duxbury Press, Belmont, 2001).
 (16) M. E. J. Newman, G. T. Barkema, Monte Carlo Methods in Statistical Physics (Clarendon Press, Oxford, 1999).
 (17) M. E. J. Newman, Phys. Rev. Lett. 89, 208701 (2002).
 (18) M. Huss, P. Holme, IET Systems Biology 1, 280 (2007).
 (19) V. Krebs, Connections 24, 43 (2002).
 (20) H. A. Dawah, B. A. Hawkins, M. F. Claridge, Journal of Animal Ecology 64, 708 (1995).
 (21) D. Bryant, BioConsensus, M. Janowitz, F.J. Lapointe, F. R. McMorris, B. Mirkin, F. Roberts, eds. (DIMACS, 2003), pp. 163–184.
 (22) J. A. Dunne, R. J. Williams, N. D. Martinez, Proc. Natl. Acad. Sci. USA 99, 12917 (2002).
 (23) A. Szilágyi, V. Grimm, A. K. Arakaki, J. Skolnick, Physical Biology 2, S1 (2005).
 (24) E. Sprinzak, S. Sattath, H. Margalit, J. Mol. Biol. 327, 919 (2003).
 (25) T. Ito, et al., Proc. Natl. Acad. Sci. USA 98, 4569 (2001).
 (26) A. Lakhina, J. W. Byers, M. Crovella, P. Xie, Proc. INFOCOM (2003).
 (27) A. Clauset, C. Moore, Phys. Rev. Lett. 94, 018701 (2005).
 (28) N. D. Martinez, B. A. Hawkins, H. A. Dawah, B. P. Feifarek, Ecology 80, 1044 (1999).
 (29) J. A. Hanley, B. J. McNeil, Radiology 143, 29 (1982).
 (30) M. SalesPardo, R. Guimerá, A. A. Moreira, L. A. N. Amaral, Proc. Natl. Acad. Sci. USA 104, 15224 (2007).
Supplementary Information
Appendix A Hierarchical random graphs
Our model for the hierarchical organization of a network is as
follows.^{2}^{2}2Computer code implementing many of the analysis methods
described in this paper can be found online at
www.santafe.edu/aaronc/randomgraphs/.
Let be a graph with vertices. A dendrogram is a binary
tree with leaves corresponding to the vertices of . Each of
the internal nodes of corresponds to the group of vertices that are
descended from it. We associate a probability with each internal
node . Then, given two vertices of , the probability
that they are connected by an edge is where is their
lowest common ancestor in . The combination of the
dendrogram and the set of probabilities then defines a hierarchical
random graph.
Note that if a community has, say, three subcommunities, with an equal probability of connections between them, we can represent this in our model by first splitting one of these subcommunities off, and then splitting the other two. The two internal nodes corresponding to these splits would be given the same probabilities . This yields three possible binary dendrograms, which are all considered equally likely.
We can think of the hierarchical random graph as a variation on the classical Erdős–Rényi random graph . As in that model, the presence or absence of an edge between any pair of vertices is independent of the presence or absence of any other edge. However, whereas in every pair of vertices has the same probability of being connected, in the hierarchical random graph the probabilities are inhomogeneous, with the inhomogeneities controlled by the topological structure of the dendrogram and the parameters . Many other models with inhomogeneous edge probabilities have, of course, been studied in the past. One example is a structured random graph in which there are a finite number of types of vertices with a matrix giving the connection probabilities between them.^{3}^{3}3F. McSherry, “Spectral Partitioning of Random Graphs.” Proc. Foundations of Computer Science (FOCS), pp. 529–537 (2001)
Appendix B Fitting the hierarchical random graph to data
Now we turn to the question of finding the hierarchical random graph or graphs that best fits the observed realworld network . Assuming that all hierarchical random graphs are a priori equally likely, the probability that a given model
is the correct explanation of the data is, by Bayes’ theorem, proportional to the posterior probability or
likelihood with which that model generates the observed network.^{4}^{4}4G. Casella and R. L. Berger, “Statistical Inference.” Duxbury Press, Belmont (2001). Our goal is to maximize or, more generally, to sample the space of all models with probability proportional to .Let be the number of edges in whose endpoints have as their lowest common ancestor in , and let and , respectively, be the numbers of leaves in the left and right subtrees rooted at . Then the likelihood of the hierarchical random graph is
(1) 
with the convention that .
If we fix the dendrogram , it is easy to find the probabilities that maximize . For each , they are given by
(2) 
the fraction of potential edges between the two subtrees of that actually appear in the graph . The likelihood of the dendrogram evaluated at this maximum is then
(3) 
Figure S1 shows an illustrative example, consisting of a network with six vertices.
It is often convenient to work with the logarithm of the likelihood,
(4) 
where is the GibbsShannon entropy function. Note that each term is maximized when is close to or to , i.e., when the entropy is minimized. In other words, highlikelihood dendrograms are those that partition the vertices into groups between which connections are either very common or very rare.
We now use a Markov chain Monte Carlo method to sample dendrograms
with probability proportional to their likelihood . To create the Markov chain we need to pick a set of transitions between possible dendrograms. The transitions we use consist of rearrangements of subtrees of the dendrogram as follows. First, note that each internal node of a dendrogram is associated with three subtrees: the subtrees descended from its two daughters, and the subtree descended from its sibling. As Figure S2 shows, there are two ways we can reorder these subtrees without disturbing any of their internal relationships. Each step of our Markov chain consists first of choosing an internal node uniformly at random (other than the root) and then choosing uniformly at random between the two alternate configurations of the subtrees associated with that node and adopting that configuration. The result is a new dendrogram . It is straightforward to show that transitions of this type are ergodic, i.e., that any pair of finite dendrograms can be connected by a finite series of such transitions.Once we have generated our new dendrogram we accept or reject that dendrogram according to the standard Metropolis–Hastings rule.^{5}^{5}5M. E. J. Newman and G. T. Barkema, “Monte Carlo Methods in Statistical Physics.” Clarendon Press, Oxford (1999). Specifically, we accept the transition if is nonnegative, so that is at least as likely as ; otherwise we accept the transition with probability
. If the transition is not accepted, the dendrogram remains the same on this step of the chain. The MetropolisHastings rule ensures detailed balance and, in combination with the ergodicity of the transitions, guarantees a limiting probability distribution over dendrograms that is proportional to the likelihood,
. The quantity can be calculated easily, since the only terms in Eq. (4) that change from to are those involving the subtrees , , and associated with the chosen node.The Markov chain appears to converge relatively quickly, with the likelihood reaching a plateau after roughly steps. This is not a rigorous performance guarantee, however, and indeed there are mathematical results for similar Markov chains that suggest that equilibration could take exponential time in the worst case.^{6}^{6}6E. Mossel and E. Vigoda, “Phylogenetic MCMC Are Misleading on Mixtures of Trees.” Science 309, 2207 (2005) Still, as our results here show, the method seems to work quite well in practice. The algorithm is able to handle networks with up to a few thousand vertices in a reasonable amount of computer time.
We find that there are typically many dendrograms with roughly equal likelihoods, which reinforces our contention that it is important to sample the distribution of dendrograms rather than merely focusing on the most likely one.
Appendix C Resampling from the hierarchical random graph
The procedure for resampling from the hierarchical random graph is as follows.

Initialize the Markov chain by choosing a random starting dendrogram.

Run the Monte Carlo algorithm until equilibrium is reached.

Sample dendrograms at regular intervals thereafter from those generated by the Markov chain.

For each sampled dendrogram , create a resampled graph with vertices by placing an edge between each of the vertex pairs with independent probability , where is the lowest common ancestor of and in and is given by Eq. (2). (In principle, there is nothing to prevent us from generating many resampled graphs from a dendrogram, but in the calculations described in this paper we generate only one from each dendrogram.)
After generating many samples in this way, we can compute averages of network statistics such as the degree distribution, the clustering coefficient, the vertexvertex distance distribution, and so forth. Thus, in a way similar to Bayesian model averaging,^{7}^{7}7T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning.” Springer, New York (2001).
we can estimate the distribution of network statistics defined by the equilibrium ensemble of dendrograms.
For the construction of consensus dendrograms such as the one shown in Fig. 2a, we found it useful to weight the most likely dendrograms more heavily, giving them weight proportional to the square of their likelihood, in order to extract a coherent consensus structure from the equilibrium set of models.
Appendix D Predicting missing connections
Our algorithm for using hierarchical random graphs to predict missing connections is as follows.

Initialize the Markov chain by choosing a random starting dendrogram.

Run the Monte Carlo algorithm until equilibrium is reached.

Sample dendrograms at regular intervals thereafter from those generated by the Markov chain.

For each pair of vertices for which there is not already a known connection, calculate the mean probability that they are connected by averaging over the corresponding probabilities in each of the sampled dendrograms .

Sort these pairs in decreasing order of and predict that the highestranked ones have missing connections.
In general, we find that the top 1% of such predictions are highly accurate. However, for large networks, even the top 1% can be an unreasonably large number of candidates to check experimentally. In many contexts, researchers may want to consider using the procedure interactively, i.e., predicting a small number of missing connections, checking them experimentally, adding the results to the network, and running the algorithm again to predict additional connections.
The alternative prediction methods we compared against, which were previously investigated in^{8}^{8}8D. LibenNowell and J. Kleinberg, “The link prediction problem for social networks.” Proc. Internat. Conf. on Info. and Know. Manage. (2003)., consist of giving each pair of vertices a score, sorting pairs in decreasing order of their score, and predicting that those with the highest scores are the most likely to be connected. Several different types of scores were investigated, defined as follows, where is the set of vertices connected to .

Common neighbors: score, the number of common neighbors of vertices and .

Jaccard coefficient: score, the fraction of all neighbors of and that are neighbors of both.

Degree product: score, the product of the degrees of and .

Short paths: score is divided by the length of the shortest path through the network from to (or zero for vertex pairs that are not connected by any path).
One way to quantify the success of a prediction method, used by previous authors who have studied link prediction problems, is the ratio between the probability that the topranked pair is connected and the probability that a randomly chosen pair of vertices, which do not have an observed connection between them, are connected. Figure S4 shows the average value of this ratio as a function of the percentage of the network shown to the algorithm, for each of our three networks. Even when fully of the network is missing, our method predicts missing connections about ten times better than chance for all three networks. In practical terms, this means that the amount of work required of the experimenter to discover a new connection is reduced by a factor of , an enormous improvement by any standard. If a greater fraction of the network is known, the accuracy becomes even greater, rising as high as times better than chance when only a few connections are missing.
We note, however, that using this ratio to judge prediction algorithms has an important disadvantage. Some missing connections are much easier to predict than others: for instance, if a network has a heavytailed degree distribution and we remove a randomly chosen subset of the edges, the chances are excellent that two highdegree vertices will have a missing connection and such a connection can be easily predicted by simple heuristics such as those discussed above. The AUC statistic used in the text, by contrast, looks at an algorithm’s overall ability to rank all the missing connections over nonexistent ones, not just those that are easiest to predict.
Finally, we have investigated the performance of each of the prediction algorithms on purely random (i.e., Erdős–Rényi) graphs. As expected, no method performs better than chance in this case, since the connections are completely independent random events and there is no structure to discover. We also tested each algorithm on a graph with a powerlaw degree distribution generated according to the configuration model.^{9}^{9}9M. Molloy and B. Reed, “A critical point for random graphs with a given degree sequence.” Random Structures and Algorithms 6, 161–179 (1995) In this case, guessing that highdegree vertices are likely to be connected performs quite well, whereas the method based on the hierarchical random graph performs poorly since these graphs have no hierarchical structure to discover.