Community detection, a form of unsupervised learning, is the task of identifying dense subgraphs within a large graph. For surveys of recent work, see[1, 2, 3]. Community detection is often studied in the context of a generative random graph model, of which the stochastic block model is the most popular. The model specifies how the labels of the vertices are chosen, and how the edges are placed, given the labels. The task of community detection then becomes an inference problem; the vertex labels are the parameters to be inferred, and the graph structure is the data. The advantage of a generative model is that it helps in the design of algorithms for community detection.
The stochastic block model fails to capture two basic properties of networks that are seen in practice. Firstly, it does not model networks that grow over time, such as citation networks or social networks. Secondly, it does not model graphs with heavy-tailed degree distributions, such as the political blog network . The Barabási-Albert model , a.k.a. the preferential attachment model, is a popular random graph model that addresses both the above shortcomings. We use the variation of the model introduced by Jordan  that includes community structure. The paper  considers labels coming from a metric space, though a section of the paper focuses on the case the label space is finite. We consider only a finite label set–the model is described in Section II-A. In recent years there has been substantial study of a variation of preferential attachment model introduced in  such that different vertices can have different fitness. For example, in a citation network, some papers attract more citations than others published at the same time. There has also been work done on recovering clusters from graphs with different fitness (see Chapter 9 of  and references therein). Our work departs from previous work by considering community detection for the model in which the affinity for attachment between an arriving vertex and an existing vertex depends on the labels of both vertices (i.e. for the model of ).
The algorithm we focus on is message passing. Algorithms that are precursors to message passing, in which the membership of a vertex is estimated from its radius one neighborhood in the graph, are also discussed. The algorithm is closest in spirit to that in the papers [9, 10]. Message passing algorithms are local algorithms; vertices in the graph pass messages to each of their neighbors, in an iterative fashion. The messages in every iteration are computed on the basis of messages in the previous iteration. The degree growth rates for vertices in different communities are different (unless there happens to be a tie) so the neighborhood of a vertex conveys some information about its label. A quantitative estimate of this information is the belief (a posteriori probability) of belonging to a particular community. A much better estimate of a vertex’s label could potentially be obtained if the labels of all other vertices were known. Since this information is not known, the idea of message passing algorithms is to have vertices simultaneously update their beliefs.
The main similarity between the preferential attachment model with communities and the stochastic block model is that both produce locally tree-like graphs. However, the probabilities of edges existing are more complicated for preferential attachment models. To proceed to develop the message passing algorithm, we invoke an independence assumption that is suggested by an analysis of the joint degree evolution of multiple vertices. This approach is tantamount to constructing a belief propagation algorithm for a graphical model that captures the asymptotic distribution of neighborhood structure for the preferential attachment graphs.
Organization of the paper
Section II lays the groundwork for the problem formulation and analysis of the community detection problem. It begins by presenting a model for a graph with preferential attachment and community structure, following . The section then presents some key properties of the graphical model in the limit of a large number of vertices. In particular, the empirical distribution of degree, and the evolution of degree of a finite number of vertices, are examined. Stochastic coupling and total variation distance are used extensively. In addition, it is shown that the growth rate parameter for a given fixed vertex can be consistently estimated as the size of the graph converges to infinity. Section III formulates the community recovery problem as a Bayesian hypothesis testing problem, and focuses on two precursors to the message passing algorithm. The first, Algorithm C, estimates the community membership of a vertex based on the children of the vertex (i.e. vertices that attached to the vertex). The second, Algorithm DT, estimates the community membership of a vertex based on the number of children. Section IV investigates an asymptotically equivalent recovery problem, based on a continuous-time random process that approximates the evolution of degree of a vertex in a large graph. A key conclusion of that section is that, for the purpose of estimating the community membership of a single vertex, knowing the neighborhood of the vertex in the graph is significantly more informative than knowing the degree of the vertex. Section V presents our main results about how the performance of the recovery Algorithms C and DT scale in the large graph limit. Section VI presents an extension of Algorithm C whereby the labels of a fixed small set of vertices are jointly estimated based on the likelihood of their joint children sets. This algorithm has exponential complexity in the number of labels estimated, but can be used to seed the message passing algorithm. Since the vertices that arrive early have large degree, it can greatly help to correctly estimate the labels of a small number of such vertices. The message passing algorithm is presented in Section VII. Simulation results are given for a variety of examples in Section VIII. Various proofs, and the derivation of the message passing algorithm, can be found in the appendices.
A different extension of preferential attachment to include communities is given in . In , the community membership of a new vertex is determined based on the membership of the vertices to which the new vertex is attached. The paper focuses on the question of whether large communities coexist as the number of vertices converges to infinity. However, the dynamics of the graph itself is the same as in the original Barabási-Albert model. In contrast, our model assumes that community membership of a vertex is determined randomly before the vertex arrives, and the distribution of attachments made depends on the community membership. It might be interesting to consider a combination of the two models, in which some vertices determine community membership exogenously, and others determine membership based on the memberships of their neighbors.
Another model of graphs with community structure and possibly heavy-tailed degree distribution is the degree corrected stochastic block model – see  for recent work and references.
There is an extensive literature on degree distributions and related properties of preferential attachment graphs, and an even larger literature on the closely related theory of Polya urn schemes. However, the addition of planted community structure breaks the elegant exact analysis methods, such as the matching equivalence formulated in , or methods such as in  or . Still, the convergence of the empirical distribution of the induced labels of half edges (see Proposition 2 below) makes the analysis tractable without the exact formulas. A sequence of models evolved from preferential attachment with fitness , towards the case examined in , such that the attachment probability is weighted by a factor depending on the labels of both the new vertex and a potential target vertex. The model of  is a special case, for which attachment is possible if the labels are sufficiently close. See [6, 16, 8] for additional background literature.
Ii Preliminaries and some asymptotics
Ii-a Barabási - Albert preferential attachment model with community structure
The model consists of a sequence of directed graphs, and vertex labels with distribution determined by the following parameters:222The model is the same as the finite metric space case of  except for differences in notation. in  are here. Also,  denotes the initial graph as while we denote it by , we assume it has edges, and we suppose the random evolution begins with the addition of vertex
: out degree of each added vertex
: number of possible labels; labels are selected from
: a priori label probability distribution
: matrix of strictly positive affinities for vertices of different labels; is the affinity of a new vertex with label for attachment to a vertex of label
: initial time
: initial directed graph with and directed edges
: labels assigned to vertices in
For each , has vertices given by and edges. The graphs can contain parallel edges. No self loops are added during the evolution, so if has no self loops, none of the graphs will have self loops. Of course, by ignoring the orientation of edges, we could obtain undirected graphs.
Given the labeled graph the graph is constructed as follows. First vertex is added and its label is randomly selected from using distribution independently of Then outgoing edges are attached to the new vertex, and the head ends of those edges are selected from among the vertices in
using sampling with replacement, and probability distribution given by preferential attachment, weighted based on labels according to the affinity matrix.
The probabilities are calculated as follows. Note that has edges, and thus half edges, where we view each edge as the union of two half edges. For any edge, its two half edges are each incident to a vertex; the vertices the two half edges are incident to are the two vertices the edge is incident to. Suppose each half edge inherits the label from the vertex it is incident to. If , meaning the new vertex has label and if one of the existing half edges has label then the half edge is assigned weight for the purpose of adding edges outgoing from vertex For each one of the new edges outgoing from vertex an existing half edge is chosen at random from among the possibilities, with probabilities proportional to such weights. The selection is done simultaneously for all of the new edges, or equivalently, sampling with replacement is used. Then the vertices of the respective selected half edges become the head ends of the new edges.
Ii-B Empirical degree distribution for large
For a vertex in , where the distribution of the number of edges incident on the vertex from vertex depends on the label of the vertex, the degree of the vertex, and the labels on all the half edges incident to the existing vertices in The empirical distribution of labels of half edges in converges almost surely as as explained next. Let for , where denotes the number of half edges with label in It is easy to see that is a discrete-time Markov process, with initial state determined by the labels of vertices in Let Thus, is the fraction of half edges that have label at time Let where
A second limit result we restate from  concerns the empirical degree distribution for the vertices with a given label. For and integers and let:
denote the number of vertices with label in
denote the number of vertices with label and with degree in
denote the fraction of vertices with label that have degree in
 (Limiting empirical distribution of degree for a given label) Let and be fixed. Then almost surely, where
The asymptotic equivalence in (3) as follows from Sterling’s formula for the Gamma function. The proposition shows that the limiting degree distribution of a vertex with label selected uniformly at random from among the vertices with label in has probability mass function with tail decreasing like If is the same for all then for all and we see the classical tail exponent -3 for the Barabási-Albert model.
The proof of Proposition 2 given in  is based on examining the evolution of the fraction of vertices with a given label and given degree Using the convergence analysis of stochastic approximation theory, this yields limiting difference equations for that can be solved to find However, since all vertices with a given label are grouped together, the analysis does not identify the limiting degree distribution of a vertex as a function of the arrival time of the vertex.
The following section investigates the evolution of the degree of a single vertex, or finite set of vertices, conditioned on their labels. As a preliminary application, we produce an alternative proof of Proposition 2 in Appendix D. The main motivation for this alternative approach is that it can also be applied to analyze the probability of label error as a function of time of arrival of a vertex, for two of the recovery algorithms we consider.
Ii-C Evolution of vertex degree–the processes and
Consider the preferential attachment model defined in Section II-A. Given a vertex with , consider the process where is the degree of vertex at time So The conditional distribution (i.e. probability law) of given is given by:
It follows that, given the conditional distribution of
is a mixture of binomial distributions with selection probability distribution, which we write as:
Proposition 1 implies, given any , if is sufficiently large, Therefore, for
A mixture of binomial distributions, all with small means, can be well approximated by a Bernoulli distribution with the same mean. Thus, we expect
Based on these observations, we define a random process that is an idealized variation of obtained by replacing by the constant vector and allowing jumps of size one only. The process has parameters and , where is the activation time, is the state at the activation time, and is a rate parameter. The process is a time-inhomogeneous Markov process with initial value For and such that , we require:
By induction, starting at time , we find that for If and , then for all with probability one, in which case (4) and the initial condition completely specify the distribution of However, for added generality we allow in which case the above construction can break down. To address such situation, we define such that is the stopping time and we define for
The process can be thought of as a (non Markovian) discrete time birth process with activation time and birth probability at a time proportional to the number of individuals. However, the birth probability (or birth rate) per individual, , has a factor which tends to decrease the birth rate per individual. To obtain a process with constant birth rate per individual we introduce a time change by using the process In other words, we use for the original time variable and as a new time variable. We will define a process such that , or equivalently, in a sense to be made precise.
The process is a continuous time pure birth Markov process with initial state and birth rate in state for some (It is a simple example of a Bellman-Harris process, and is related to busy periods in Markov queueing systems.) The process represents the total number of individuals in a continuous time branching process beginning with individuals activated at time 0, such that each individual spawns another at rate For fixed , has the negative binomial distribution In other words, its marginal probability distribution is given by
In particular, taking shows
is the geometric distribution with parameter, and hence, mean The expression (5) can be easily derived for by solving the Kolmogorov forward equations recursively in : for with the convention and base case, For the process has the same distribution as the sum of independent copies of with proving the validity of (5) by the same property for the negative binomial distribution.
Let for integers The mapping from to does not depend on the parameter so a hypothesis testing problem for maps to a hypothesis testing problem for There is loss of information because the mapping is not invertible, but the loss tends to zero as because the rate of sampling of increases without bound.
The following proposition, proven in Appendix B, shows that and are asymptotically equivalent in the sense of total variation distance. Since the processes and are integer valued, discrete time processes, their trajectories over a finite time interval have discrete probability distributions. See the beginning of Appendix B for a review of the definition of total variation distance and its significance for coupling. Sometimes we write instead of , and instead of to denote the dependence on the parameter
Suppose such that and is bounded. Fix Then
and for any ,
The first part of Proposition 3 can be strengthened as follows. The labels in are mutually independent, each with distribution We can define a joint probability distribution over
by specifying the conditional probability distribution ofgiven as follows. Given , is a Markov sequence with and:
By the law of total probability, this gives the same marginal distribution foras (4) with as long as
Suppose such that and is bounded. Fix Then
The proof is a minor variation of the proof of Proposition 3 because the estimates on total variation distance are uniform for or bounded. Details are left to the reader.
Ii-D Joint evolution of vertex degrees
Instead of considering the evolution of degree of a single vertex we consider the evolution of degree for a finite set of vertices, still for the preferential attachment model with communities, defined in Section II-A. Given integers with , let if and let denote the degree of vertex at time if . Let Let We consider the evolution of given Let for About the notation vs. : The vector denotes the limiting rate parameters for the possible vertex labels defined in (2), whereas denotes the limiting rate parameters for the specific set of vertices being focused on, conditioned on their labels being
The process is defined similarly. Fix , integers with , and Suppose for each that is a version of the process defined in Section II-C, with parameters , , and with the extension for Furthermore, suppose the processes are mutually independent. Finally, let where Note that is itself a time-inhomogeneous Markov process. In what follows we write instead of when we wish to emphasize the dependence on the parameter vector Let be defined analogously, based on
Fix the parameters of the preferential attachment model, Fix and and let for Let and let and vary such that , and is bounded. Then
The proposition is proved in Appendix C. A key implication of the proposition is that the degree evolution processes for a finite number of vertices are asymptotically independent in the assumed asymptotic regime. In particular, the following corollary is an immediate consequence of the proposition. It shows that the degrees of vertices at a fixed time are asymptotically independent with marginal distributions given by (5).
Ii-E Large time evolution of degree of a fixed vertex and consistent estimation of the rate parameter of a vertex
Consider the Barabási-Albert model with communities. Fix and let denote the degree of in for To avoid triviality, assume is not an isolated vertex in the initial graph The following proposition offers a way to consistently estimate the rate parameter If the parameters
of the Barabási-Albert model are distinct, it follows that any fixed finite set of vertices could be consistently classified in the limit as, without knowledge of the model parameters.
(Large time behavior of degree evolution) For fixed,
Here, “a.s." means almost surely, or in other words, with probability one.
The following strengthening of Proposition 6 is conjectured.
(Sharp large time behavior of degree evolution) For fixed,
for a random variable
for a random variablewith
Iii Community recovery based on children
Given vertices and , we say is a child of , and is a parent of , if and there is an edge from to It is assumed that the known initial graph is arbitrary and carries no information about vertex labels. Thus, for the purpose of inferring the vertex labels, the edges in are not relevant beyond the degrees that they imply for the vertices in Assuming is an integer with , let denote the children of in and the parents of So if and
Consider the problem of estimating given observation of a random object For instance, the object could be the degree of vertex in , or it could be the set of children of in or it could be the entire graph. This is an -ary hypothesis testing problem. It is assumed a priori that the label has probability distribution so it makes sense to try to minimize the probability of error in the Bayesian framework. Let denote the log-likelihood vector defined by for By a central result of Bayesian decision theory, the optimal decision rule is the MAP estimator, given by
(i) Knowing is equivalent to knowing the indices of the vertices and the undirected graph induced by dropping the orientations of the edges of
(ii) The estimators considered in this paper are assumed to know the order of arrival of the vertices (which we take to be specified by the indices of the vertices for brevity) and the parameters , and It is clear that in some cases the parameters can be estimated from a realization of the graph for sufficiently large In particular, the parameter is directly observable. By Proposition 6, if the order of arrival is known, the set of growth rates can be estimated. So if the ’s are distinct, the distribution can also be consistently estimated.
(iii) If the indices of the vertices are not known and only the undirected version of the graph is given, it may be possible to estimate the indices if is sufficiently large. Such problem has been explored recently for the classical Barabási-Albert model , but we don’t pursue it here for the variation with a planted community.
The first recovery algorithm we describe, Algorithm C (“C" for “children"), is to let denote the set of children, , of vertex in Equivalently, could be observation of with parameters and where However, motivated by Proposition 3, we consider instead observation of which has a distribution asymptotically equivalent to the distribution of Let denote the initial degree of vertex , defined to be the degree of in if and otherwise. Given a possible children set let denote the corresponding degree evolution sample path: for The probability corresponds to children set is given by
so the log likelihood for observation is:
Algorithm C for estimating is to use the MAP estimator based on and Using the approximation and approximating the sum by an integral we find where
The second recovery algorithm we describe, Algorithm DT (“DT" for “degree thresholding"), is to let denote the number of children of vertex in , or, equivalently, the degree of at time minus the initial degree of Equivalently, could be observation of However, motivated by Proposition 3, we consider instead consider observation of which has the distribution given , for The log likelihood vector in this case, given the number of children, , is:
where we have dropped a term (log of binomial coefficient) not depending on Algorithm DT for estimating is to use the MAP decision rule based on and or in other words, the MAP decision rule based on or equivalently, based on where (because ). Let denote the resulting average error probability
Iv Hypothesis testing for
Proposition 3 gives an asymptotic equivalence of and Recall that is obtained by sampling the continuous time process at integers Thus, the continuous time process is not observable. However, as the rate that is sampled increases without bound, so asymptotically is observed. We consider here the hypothesis testing problem based on observation of such that under it has rate parameter for This is sensible in case the parameter values , are distinct. To this end, we derive the log likelihood vector.
Suppose such that and . Since the inter-jump periods are independent (exponential) random variables, the likelihood of being the jump times during under hypothesis , is the product of the likelihoods of the observed inter-jump periods, with an additional factor of the likelihood of not seeing a jump in the last interval:
Thus, the log likelihood for observing this is (letting ):
(With , (13) is the same as (12), although in (12) the variables are supposed to be integer valued.) Let Note that is the area under the trajectory of . Moreover, is the value of . So the log-likelihood vector is given by:
which is a linear combination of and Thus, the MAP decision rule has a simple form. Let denote the average error probability for the decision rule based on observation of
There is apparently no closed form expression for the distribution of so computation of
apparently requires Monte Carlo simulation or some other numerical method. A closed form expression for the moment generating function ofis given in the following proposition, proved in Appendix F, and it can be used to either bound the probability of error or to accelerate its estimation by importance sampling.
The joint moment generating function of
The joint moment generating function ofand is given as follows, where denotes expectation assuming the parameters of are
Proposition 7 can be used to bound for the special case of two possible labels, in which estimating is a binary hypothesis testing problem: , vs. For such a problem the likelihood vector can be replaced by the log likelihood ratio, By a standard result in the theory of binary hypothesis testing (due to , stated without proof in , proved in special case in , and same proof easily extends to general case), the probability of error for the MAP decision rule is bounded by
where the Bhattacharyya coefficient (or Hellinger integral) is defined by and and
are the prior probabilities on the hypotheses. The proposition with, and yields
Here we wrote to denote it as the Bhattacharyya coefficient for Algorithm C (for the large limit). Using this expression in (16) provides upper and lower bounds on in case
For the sake of comparison, we note that the Bhattacharyya coefficient for the hypothesis testing problem based on alone, i.e. Algorithm DT, is easily found to be:
V Performance scaling for Algorithms C and DT
Consider the community recovery problem for and fixed, and large such that the rate parameters are distinct. Let be an arbitrarily small positive constant. The problem of recovering for some vertex with from using children (C) (respectively, degree thresholding (DT)) is asymptotically equivalent to the -ary hypothesis testing problem for observation (respectively, ) with the same parameters and This leads to the following proposition, based on the results on coupling of , and and the connection of to
(Performance scaling for Algorithms C and DT) (a) Let