1 Introduction
How do the social communities and their selforganization arise from coordinated interactions and information sharing among the actors? One way to tap into this question is to understand the latent roles and minds of actors which lead to the formation and organization of these communities. We are particularly interested in uncovering the functional/sociological underpinning of network actors, and their influence on the network modularity and hierarchy.
Existing methods for inferring actor roles [10, 16, 9] are limited in that they assume each actor undertakes a single, invariant role when interacting with other actors. This is in stark constrast with real social networks where actors can play multiple roles (or be under multiple cultural influences); the specific role being played depends on whom the actor is interacting with. The Mixed membership stochastic blockmodel (MMSB) [2] proposed recently captures the multifaceted, contextspecific nature of an actor’s role.
However, the presence of hierarchical structures in realworld social networks raises a significant challenge to current network models. The MMSB formalism described above is intrinsically flat in its structure — all roles of an actor are equal citizens in terms of their relationships to each other — so it does not induce hierarchical structures among actors. A hierarchical structure goes beyond simple clustering by explicitly including organization at all scales in a network simultaneously. Taking a corporate email network as example (Figure 1), suppose a finance department employee in the European branch is communicating with an American branch executive. The European employee might not communicate with a (possibly businessoriented) role related to her membership in the European finance department (which normally has no business with the American finance department), but simply with a more generic social role as a member of the European branch of the same company (which has occasional acquaintance with the American colleagues). However, when this same employee interacts with a European human resource employee, she would likely do so with her role as a European finance department member. This notion of interactionspecific, multicommunity membership accounts for otherwise unusual edges — like the interaction between the European finance employee and the American executive — and motivates an intuitive yet powerful approach that allows one to infer and visualize the semantic underpinnings of every actor and every link between actors.
We propose a treestructured hierarchical MMSB (hMMSB) to model communityhierarchies and the resultant social network as a function of the community memberships undertaken by actors during different interactions. Our method treats the community structure as a hierarchy of super and subcommunities, while accounting for the interactionspecific nature of community membership. Moreover, it automatically determines the appropriate number of subcommunities in each community. This sets our method apart from spectral clustering methods, which typically do not produce hierarchies, or agglomerative clustering methods, which are inherently restricted to binary hierarchies. We will use our model to analyze a biological food web and a physics paper citation network, and examine whether the recovered hierarchies agree with the known organizational structure of both networks. Extensive simulationbased validation of both our model’s consistency and the Gibbssampling inference algorithm will be conducted to ensure our model’s feasibility for hierarchical structure identification.
1.1 Related Work
While there has been a great deal of work on graph clustering and community structure inference [7, 14, 18, 5, 8], probabilistic generative models for hierarchical community formation and latent multirole inference of every actor and link have just started to draw attention.
As mentioned earlier, the MMSB model [2] enables inference of the latent roles of every actor and link in a network, but it cannot capture hierarchical structures of possible communities in the network. The link prediction model in [17] employs an Indian Buffet Process prior over actor positions in an infinitedimensional latent feature space. In that respect, it may be thought of as a nonparametric extension of the MMSB. However, the goal of their model is missing link prediction rather than inference of latent organizational structure.
Recently, Roy et. al. have developed an “annotated hierarchies” model [19] that generalizes an infinite relational model [13] for hierarchical group discovery. Their subsequent work on Mondrian Processes [20] provides an alternative nonparametric take on the same issues. Both the annotated hierarchies model and the Mondrian Process relational model discover binary hierarchies. The infinite stochastic blockmodel [12] recovers organizational hierarchies of arbitrary branching factor, among other types of structure like grids or partitions, but it does not offer detailed semantic underpinnings, such as contextdependent role or community membership instantiations, of every network link.
2 Probabilistic Hierarchical Community Model
Under the MMSB [2], network links can be instantiated by a rolespecific local interaction mechanism: the link between each pair of actors, say , is instantiated according to the latent role specifically undertaken by actor when it is to interact with , and that of when it is to interact with
. Crucial to the goal of roleprediction for network data, is the socalled “role vector”
of mixed membership coefficients, which succinctly captures the probabilities of an actor being involved in different roles when he/she interacts with another actor. In this paper, we leverage and generalize the “role vector” to model contextdependent mixed memberships in communities of different scales when links between actors are formed. We propose a
hierarchical MMSB (hMMSB) model for nested community structures underlying social networks, thereupon such structures as well as the community implications of every link can be inferred, given the network. In the sequel, we begin with a baseline hMMSB that requires a user to prespecify the granularity of a tree hierarchy of communities, that is, the branching degree from each group to its subgroup at the next level. We will then relax this constraint by employing a nested Chinese Restaurant Process, described in [3], to allow hierarchies of arbitrary granularities with a fixed depth.Let denote a directed graph, where is the set of vertices/actors, and is the set of edges/interactions. We adopt the convention if else , and we ignore selfedges . Because the graph is directed, is not necessarily a symmetric matrix. Given an edge , we refer to actor as the donor and actor as the receiver.
Under an hMMSB, we assume that a link follows a generative process based on two latent variables unique to each actor. The first one is a community membershippath vector (or in short, path) , which records the community memberships of this actor at different levels of the community hierarchy in the network. Our model requires an integer parameter , denoting the maximum depth to which the hierarchy should be learnt. Given , we can represent an actor path as a vector of positive integers , where denotes the branch taken by the path at level . We adopt the convention that the hierarchy root sits at level 0. For clarity of explanation, we will assume for now that the actor paths are known. Later, we shall relax this assumption by placing a suitable nonparametric prior over , allowing them to be posteriorly inferred.
The second variable is a mixed membership vector (in short, MM) , defining the probabilities of an actor identifying himself at different community levels when interacting with other actors. We place a twoparameter stickbreaking prior [3] over , so that actors are not required to participate in the same number of community levels. Stick breaking constructions work as follows: Consider a stick of length 1. Draw . Let and let be the remainder of the stick after chopping off this length . To calculate the length , draw and chop off this fraction of the remainder of the stick, giving . Thus is the fraction to chop off from the stick’s remainder, and is the length of the stick that was chopped off. In general, we draw from to and the corresponding is defined below:
(1) 
This process is known as the twoparameter GEM distribution [3] and draws from are denoted as . influences the mean of , and
influences its variance. Because the hierarchy is only learnt up to depth
, we truncate the distribution at level . The stick breaking prior makes it more intuitive to bias interactions toward coarser or finer levels compared to a Dirichlet prior with either a single parameter (which is not expressive enough), or parameters (which may be too expressive).To instantiate the edge from actor to , we introduce interaction level indicators (donor level) and (receiver level), which follow multinomial distributions defined by the MM vectors respectively. The pair specifies actor ’s interactionspecific position in the community hierarchy, denoted by , which contributes to determining his interaction probability with actor . Note that we are not proposing a full hierarchical model for a tree over conditionally entities, as in a coalescent process [11]. Our goal is to model finitedepth nested communities of entities connected by a network; therefore our latent hierarchy is not a full binary tree over all nodes, and our generative process does not output nodal attributes, but links between nodes.
Now, given interaction level indicators and , and paths and of the two actors involved, we can determine the community identities (not necessarily at the same level) underlying an interaction. These community identities specify a communitypairspecific distribution for the link , where
(2)  
The notation denotes a parameter specifying the interaction probability from community to . We shall explain the intuition behind this in more detail below.
With all these details, we arrive at the following generative process of a network :

For each actor :

Sample actor ’s path from a distribution over hierarchies (detailed in the next section).

Sample actor ’s MM (distribution over community levels): .


For each element of in the communitycompatability matrix:

Sample a value for the element .


To generate the network, for each directed edge where , :

Sample the donor level .

Sample the receiver level .

Sample the interaction .

The overall intuition behind hMMSB is that and specify a hierarchy position for the donor, and likewise for the receiver; and we use these positions to determine the interaction probability. There are two different scenarios worth mentioning. First, suppose that the donor and receiver positions share the same immediate parent in the hierarchy. Our model associates a “compatibility matrix” with every parent position in the hierarchy, which gives the probability of interaction between any two of its child positions. The set of all compatibility matrices is denoted by . Since the donor and receiver positions share the same parent, we simply look up the appropriate entry in to get the interaction probability.
In the second case, the donor and receiver positions do not share the same parent; this always happens when (i.e. the positions are at different depths). We then coarsen both levels to their minimum, . The idea is that interactions between two hierarchy positions should take place at the level of the higher (coarser) position. For example, when an executive interacts with a sales representative, we reduce the interaction to one between an executive and a generic group of “other employees” below executive rank. (It is in theory possible to explicitly model all possible interactions between communities of arbitrary levels, but this will lead to a severe overparameterization of our model. We have decided to postpone a more thorough study of this issue to later work.) If the new positions now share the same parent, then we look up the appropriate element of as before. Otherwise, we define the interaction probability to be zero.
2.1 Infinite hMMSB
Thus far, we have assumed knowledge of the paths . From a combinatorial standpoint, inferring these paths is difficult. Because the number of children at each hierarchy position is unlimited, there are infinitely many possible hierarchies or sets of paths , even with the fixed depth
. One might consider using heuristic methods that guess at the number of such children, but doing so would defeat the purpose of employing a probabilistic model.
Our solution is to place an appropriate nonparametric Bayesian prior over , the Nested Chinese Restaurant Process [3], or nCRP for short. The nCRP is an extension of the regular Chinese Restaurant Process (CRP), a recursivelydefined prior over positive integers. For concreteness, we shall use the first level of each actor path, , to define the CRP:
(3) 
where is a “concentration” parameter that controls the probability of drawing new integers, and for conciseness we define . The nCRP is essentially a hierarchy of CRP priors, beginning with a single CRP prior at the top level. With each unique integer seen at the toplevel prior, we associate a child CRP prior with observations, resulting in a twolevel tree of CRP priors. We can repeat this process ad infinitum on the newlycreated child priors, resulting in an infintelevel tree of CRP priors, though we only use a level nCRP. All CRP priors in the nCRP share the same concentration parameter .
Now we can finish describing our generative process: for each actor , we can sample using the recursive nCRP definition:
(4) 
Figure 2 displays our complete generative process as a graphical model. As a side note, the infinite nCRP prior on paths implies that contains an infinite number of elements (there could be infinitely many children at each parent, so the compatibility matrices must be infinitedimensional). Our Gibbs sampler finesses this issue by integrating out , while the posterior distribution of each can be recovered from the path and level samples.
3 Collapsed Gibbs Sampler
Exact inference on our model is intractable, so we derive a collapsed Gibbs sampling scheme for posterior inference. The ’s and ’s are integrated out for faster mixing, so we only have to sample and . The sampling equations are provided below.
Sampling levels
The distribution of conditioned on all other variables is
(5)  
where is the set of all edges except , and . By BetaBernoulli conjugacy, the first term, for a particular value of , is
First term  
(6) 
where we have defined the shorthand .
The second term can be computed by iterated expectation, conditioning on actor ’s stick breaking lengths ,…,:
(7)  
Since we have limited the maximum depth to , we simply ignore the event , and renormalize the distribution of over the domain . The conditional distribution of is derived in similar fashion.
Sampling paths
The distribution of conditioned on all other variables is
(8)  
where is the set of all edges whose distributions depend on , and is its complement. The second term can be computed using the recursive nCRP definition:
(9) 
while the first term, for a particular value of , is
First term  
(10) 
where is the set of all associated with some edge in through .
4 Simulation
We evaluate our model’s ability to recover hierarchies on simulated data. For all simulations, the number of actors was 150, the max depth was (2 levels plus root) and for all actors, meaning that actors interact at level 1 25% of the time and level 2 75% of the time. Our experiments explore the effect of different compatibility matrices . We first explore an ondiagonal , where the diagonal elements are much larger than the offdiagonal elements (actors tend to interact within their own communities). We also investigate an offdiagonal , where the offdiagonal elements are larger (actors tend to interact outside their own communities). In the “low noise” setting the offdiagonal and ondiagonal elements are far apart (to clearly distinguish them), while in the “high noise” setting they are closer together. The exact parameters for the 4 types of ’s are shown below:

ondiagonal, low noise  , ;

ondiagonal, high noise  , ;

offdiagonal, low noise  , ;

offdiagonal, high noise  , .
means that actors interacting in the same level1 community do so with probability , while actors interacting in the same level2 community do so with probability (and analogously for ).
We compare our approach to hierarchical spectral clustering (denoted HSpectral). For spectral clustering, it is unclear how the number of clusters at each node would be selected, so we give it the number of 1stlevel branches as an advantage (and then let it do a binary split at each level2 node). For hMMSB, we fix and search over , picking the value that maximizes the marginal likelihood.
The Gibbs sampler was run for 1,500 iterations on each experiment. We calculate the F1 score at each level , where , and . is true positive count (actors that are in the same cluster and should be up to depth k), is false positive count, is true negative count, and is false negative count. The total F1 score is computed by averaging the scores for each level.
Figure 3 illustrates the results as a function of the number of branches at the first level of the generated tree. The “number of branches at level 1” refers to number of branches of size (since there are often branches of size 1 or 2). For fairness, the “correct” number of level1 branches given to spectral clustering is also the number of branches of size .
As one can see, in Figure 3 and Figure 3, when is strongly ondiagonal, our algorithm performs well, but a little worse than HSpectral (since HSpectral is given the number of level1 branches). A specific example is shown in Figures 3, 3, 3, 3 where both models perform reasonably well. However, when is strongly offdiagonal (implying that actors tend to interact with those in different communities and rather than within their own community), HSpectral performs poorly. Our method still gives good results (Figure 3, Figure 3). An example is shown in Figures 3, 3, 3, 3 where our model performs accurately while HSpectral essentially divides the actors randomly and performs poorly. Thus HMMSB successfully models a variety of community interactions that traditional clustering methods cannot. It can also recover the actorspecific interaction levels for a richer network analysis.
5 Heldout Evaluation
We compare our algorithm to MMSB [2] on two realworld datasets, a 75species food web of grassfeeding wasps [6, 4], and the September 11th, 2001 hijacker terrorist network [15, 4]. Our choices reflect two common modes of interaction seen in realworld network data: edges in the food web denote predatorprey relationships, while edges in the terrorist network reflect social cohesion. The food web could be represented as a hierarchy where different branches reflect different trophic levels (e.g. parasite, predator or prey), while the terrorist network could be interpreted as an organization chart.
For each dataset, we generated 5 sets of training and test subgraphs, each obtained by randomly partitioning the actors into two equal sets. For each training subgraph, we performed a gridsearch over the parameters according to the log marginal likelihood. The remaining parameters were fixed to and . We then computed the log marginal likelihood on the corresponding 5 test subgraphs, and averaged them to obtain hMMSB’s average heldout likelihood. The procedure for MMSB was similar, except that we used 100 random restarts of the MMSB variational EM algorithm [2] on the training subgraphs. MMSB also requires the number of latent roles as a tuning parameter, so we repeated the experiment for each
. For either algorithm, log marginal likelihoods were estimated using 10,000 iterations of importance sampling.
The results are shown in Figure 4. On either dataset, our model’s heldout likelihood is superior to MMSB for all . Notably, MMSB’s likelihood peaks on both datasets at , but selecting so few roles will lead to an extremely coarse network analysis. In contrast, our model automatically recovers a suitable level of hierarchical complexity and enables rich interpretations of the data — as we shall demonstrate next. We note that the “annotated hierarchies” model [19] and infinite stochastic blockmodel [12] are good candidates for evaluation, but the authors did not make code available for computing marginal likelihoods, so we do not include those models in our evaluation.
6 Realdata Qualitative Analysis
6.1 Grassfeeding Wasp Parasitoids Food Web
We now use hMMSB to interpret realworld social networks, beginning with the grass dataset from earlier. We ran our Gibbs sampler on the full network to infer the community hierarchy structure and actor latent community MMs. The model parameters were chosen on the full network via grid search over , according to the log marginal likelihood (estimated using 10,000 importance samples). The remaining parameters were fixed to and . We ran our Gibbs sampler on the optimal parameters for 10,000 iterations of burnin, and took 100 samples with a lag time of 5 iterations. Convergence was determined from a plateauing log complete likelihood plot.
The Gibbs samples represent a posterior distribution over paths . In order to represent the “average” of this posterior, we generated a consensus sample by counting the number of times each pair of actors shared the same community hierarchy position, over all samples. Actors that shared positions in of all samples were assigned to the same path in the consensus. For levels and , we simply took the mode over all samples. In a final postprocessing step to reduce visual clutter, we merged bottomlevel (i.e. level2) communities with actors into one community under the same parent.
The inferred community hierarchy and MM vectors from our Gibbs sampler are reported in Figure 6. We also show the original network, where links have been augmented with their communities and interaction levels (missing links are not shown). The dataset contains trophic level annotations, shown in the hierarchy as counts and in the network as node shapes.
We see that community 3 contains all grass species, 2 contains most herbivores, and 1 contains most parasitoids; in contrast, an assortative clustering algorithm would not discover these multipartite divisions. The “outlier” communities are still more interesting — the herbivore in community 6 is the sole prey of the parasitoids in community 4; hMMSB has separated this subweb from the main foodweb. Moreover, the herbivores in community 2.2 are the sole prey of species 42 and 41 in community 1.1. We also observe that community 5’s two apex parasitoids have the largest and 2ndlargest range of prey species, while community 1.2 contains the apex parasitoid with the thirdlargest range. As to why these communities are separated, we note that the parasitoid in subcommunity 1.2 preys on only two herbivores, compared to at least six herbivores for either parasitoid in community 5.
The community MM vectors in Figure 6 show the frequency at which each species identifies itself with a particular community. Most species identify at the supercommunity level, though some occasionally identify at the subcommunity level. In our model, level2 interactions occur only within supercommunities, hence they account for finegrained, withincommunity interactions. For example, the withincommunity links in community 4, as well as the links from species 65 in subcommunity 1.2 to other members of community 1, are all level2 interactions. Although we have not shown interaction levels for missing links, the latter are sometimes accounted for by level2 interactions (e.g. in community 1).
6.2 Highenergy Physics Citation Network
Finally, we consider a 1,000paper subgraph of the arXiv highenergy physics citation network [1], which we constructed by subsampling papers involved in citations from Jan 2002 through May 2003.
We applied the same parameter selection, convergence criteria, and postprocessing as the previous dataset, and the optimal parameters were . Each of the 25 parameter combinations required less than 6 hours to test on a single processor core. We ran our Gibbs sampler for 10,000 iterations of burnin, and took 10 samples with a lag time of 50 iterations. The entire Gibbs sampling procedure completed in just under 23 hours on a single processor core.
The inferred community hierarchy is shown in Figure 7, where each subcommunity has been annotated with its papers’ most frequent title words^{1}^{1}1While this output is reminiscent of topic models, we stress that hMMSB is not a topic model — the actor paths are learnt solely from the citation network, independent of the paper contents.. We also show the adjacency matrix in Figure 5, permuted to match the order of inferred communities. We observe that the 810paper subcommunity has a sparse citation pattern, implying that its papers are not specific to a particular research topic. This is confirmed by the top 3 keywords: “theory”, “field” and “quantum”, which are general to physics research. The other subcommunities from the same supercommunity are more focused, with top keywords like “supergravity”, “string” and “ppwave” being more specific physical concepts. This is also reflected in the adjacency matrix, which is denser among these subcommunities. The remaining supercommunities form a very dense subnetwork that is mostly separated from the rest of the graph, suggesting that these papers might come from a specific community of researchers, working on a narrow set of research topics. In particular, three of the subcommunities involve the title keyword “tachyon”, which is absent from the large supercommunity. To this point, we have only scratched the surface of the citation network — a deeper analysis would involve the abstracts/contents of the papers in the hMMSBinferred communities.
7 Conclusion
We have developed a treestructured hierarchical MixedMembership Stochastic Blockmodel (hMMSB) that models social networks in terms of the multiple, hierarchical community memberships that actors undertake during interactions. Our model automatically infers the number of subcommunities in each community while simultaneously recovering the communitymixedmemberships of every actor, setting it apart from hierarchydiscovering methods that are restricted to binary hierarchies and/or singlerolememberships for actors. Moreover, hMMSB is expressive enough to account for nondiagonal communitycompatibility matrices, as we have demonstrated through our simulation and grass dataset experiments. On real networks, we show that hMMSB recovers intuitive, mixedmembership community organization. Finally, our collapsed Gibbs sampler efficiently scales to mediumsized datasets of around 1,000 actors, completing inference in a single day on a single processor core.
Acknowledgements
This paper is based on work supported by NSF IIS0713379, AFOSR FA9550010247, ONR N000140910758, DARPA NBCH1080007, NIH 1R01GM093156, and an Alfred P. Sloan Research Fellowship to Eric P. Xing. Qirong Ho is supported by a graduate fellowship from the Agency for Science, Technology And Research, Singapore.
References
 [1] KDD 2003. KDD Cup 2003  Datasets. http://www.cs.cornell.edu/projects/kddcup/datasets.html, June 2010.

[2]
E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing.
Mixed membership stochastic blockmodels.
The Journal of Machine Learning Research
, 9:1981–2014, 2008.  [3] D.M. Blei, T.L. Griffiths, and M.I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):1–30, 2010.
 [4] A. Clauset, C. Moore, and MEJ Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.
 [5] A. Clauset, MEJ Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70(6):66111, 2004.
 [6] H.A. Dawah, B.A. Hawkins, and M.F. Claridge. Structure of the parasitoid communities of grassfeeding chalcid wasps. Journal of animal ecology, 64(6):708–720, 1995.
 [7] M. Girvan and MEJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821, 2002.
 [8] R. Guimera and L.A.N. Amaral. Functional cartography of complex metabolic networks. Nature, 433:895–900, 2005.
 [9] M.S. Handcock, A.E. Raftery, and J.M. Tantrum. Modelbased clustering for social networks. Journal of the Royal Statistical SocietySeries A, 170(2):301–354, 2007.
 [10] P.D. Hoff, A.E. Raftery, and M.S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97:1090–1098, 2002.
 [11] R.R. Hudson. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7:1–44, 1990.
 [12] C. Kemp and J.B. Tenenbaum. The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31):10687, 2008.

[13]
C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda.
Learning systems of concepts with an infinite relational model.
In
Proceedings of the National Conference on Artificial Intelligence
, volume 21, page 381. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.  [14] A.E. Krause, K.A. Frank, D.M. Mason, R.E. Ulanowicz, and W.W. Taylor. Compartments revealed in foodweb structure. Nature, 426(6964):282–285, 2003.
 [15] V.E. Krebs. Mapping networks of terrorist cells. Connections, 24(3):43–52, 2002.
 [16] W. Li and A. McCallum. Pachinko allocation: DAGstructured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, page 584. ACM, 2006.
 [17] K.T. Miller, T.L. Griffiths, and M.I. Jordan. Nonparametric Latent Feature Models for Link Prediction. Advances in Neural Information Processing Systems (NIPS), 2009.
 [18] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences, 101(9):2658, 2004.
 [19] D.M. Roy, C. Kemp, V.K. Mansinghka, and J.B. Tenenbaum. Learning annotated hierarchies from relational data. Advances in neural information processing systems, 19:1185, 2007.
 [20] Y.W. Teh and D. Roy. The Mondrian Process. Advances in neural information processing systems, 2009.