1 Introduction
Many modern data sets contain not only rich information about each object, but also pairwise relationships between them, forming networks where each object is a node and links represent the relationships. In document networks, for example, each node is a document containing a sequence of words, and the links between nodes are citations or hyperlinks. Both the content of the documents and the topology of the links between them are meaningful.
Over the past few years, two disparate communities have been approaching these data sets from different points of view. In the data mining community, the goal has been to augment traditional approaches to learning and data mining by including relations between objects [15, 33] for instance, to use the links between documents to help us label them by topic. In the network community, including its subset in statistical physics, the goal has been to augment traditional community structure algorithms such as the stochastic block model [14, 20, 30] by taking node attributes into account: for instance, to use the content of documents, rather than just the topological links between them, to help us understand their community structure.
In the original stochastic block model, each node has a discrete label, assigning it to one of communities. These labels, and the
matrix of probabilities with which a given pair of nodes with a given pair of labels have a link between them, can be inferred using Monte Carlo algorithms (e.g.
[26]) or, more efficiently, with belief propagation [12, 11] or pseudolikelihood approaches [7]. However, in real networks communities often overlap, and a given node can belong to multiple communities. This led to the mixedmembership block model [1], where the goal is to infer, for each node , a distribution or mixture of labels describing to what extent it belongs to each community. If we assume that links are assortative, i.e., that nodes are more likely to link to others in the same community, then the probability of a link between two nodes and depends on some measure of similarity (say, the inner product) of and .These mixedmembership block models fit nicely with classic ideas in topic modeling. In models such as Probabilistic Latent Semantic Analysis (plsa) [19] and Latent Dirichlet Allocation (lda) [4], each document has a mixture
of topics. Each topic corresponds in turn to a probability distribution over words, and each word in
is generated independently from the resulting mixture of distributions. If we think of as both the mixture of topics for generating words and the mixture of communities for generating links, then we can infer jointly from the documents’ content and the presence or absence of links between them.There are many possible such models, and we are far from the first to think along these lines. Our innovation is to take as our starting point a particular mixedmembership block model recently developed in the statistical physics community [2], which we refer to as the bkn model. It differs from the mixedmembership stochastic block model (mmsb) of [1] in several ways:

The bkn model treats the community membership mixtures directly as parameters to be inferred. In contrast, mmsb treats
as hidden variables generated by a Dirichlet distribution, and infers the hyperparameters of that distribution. The situation between
plsa and lda is similar; plsa infers the topic mixtures , while lda generates them from a Dirichlet distribution. 
The mmsb
model generates each link according to a Bernoulli distribution, with an extra parameter for sparsity. Instead,
bkn treats the links as a random multigraph, where the number of linksbetween each pair of nodes is Poissondistributed. As a result, the derivatives of the loglikelihood with respect to
and the other parameters are particularly simple.
These two factors make it possible to fit the bkn model using an efficient and exact expectationmaximization (EM) algorithm, making its inference highly scalable. The bkn model has another advantage as well:

The bkn model is degreecorrected, in that it takes the observed degrees of the nodes into account when computing the expected number of edges between them. Thus it recognizes that two documents that have very different degrees might in fact have the same mix of topics; one may simply be more popular than the other.
In our work, we use a slight variant of the bkn model to generate the links, and we use plsa to generate the text. We present an EM algorithm for inferring the topic mixtures and other parameters. (While we do not impose a Dirichlet prior on the topic mixtures, it is easy to add a corresponding term to the update equations.) Our algorithm is scalable in the sense that each iteration takes time for networks with topics, documents, and links, where is the sum over documents of the number of distinct words appearing in each one. In practice, our EM algorithm converges within a small number of iterations, making the total running time linear in the size of the corpus.
Our model can be used for a variety of learning and generalization tasks, including document classification or link prediction. For document classification, we can obtain hard labels for each document by taking its mostlikely topic with respect to , and optionally improve these labels further with local search. For link prediction, we train the model using a subset of the links, and then ask it to rank the remaining pairs of documents according to the probability of a link between them. For each task we determine the optimal relative weight of the content vs. the link information.
We performed experiments on three realworld data sets, with thousands of documents and millions of words. Our results show that our algorithm is more accurate, and considerably faster, than previous techniques for both document classification and link prediction.
The rest of the paper is organized as follows. Section 2 describes our generative model, and compares it with related models in the literature. Section 3 gives our EM algorithm and analyzes its running time. Section 4 contains our experimental results for document classification and link prediction, comparing our accuracy and running time with other techniques. In Section 5, we conclude, and offer some directions for further work.
2 Our Model and Previous Work
In this section, we give our proposed model, which we call the Poisson mixedtopic link model (pmtlm) and its degreecorrected variant pmtlmdc.
2.1 The Generative Model
Consider a network of documents. Each document has a fixed length , and consists of a string of words for , where where is the number of distinct words. In addition, each pair of documents has an integer number of links connecting them, giving an adjacency matrix . There are topics, which play the dual role of the overlapping communities in the network.
Our model generates both the content and the links as follows. We generate the content using the plsa model [19]. Each topic is associated with a probability distribution over words, and each document has a probability distribution over topics. For each document and each , we independently

choose a topic , and

choose the word .
Thus the total probability that is a given word is
(1) 
We assume that the number of topics is fixed. The distributions and are parameters to be inferred.
We generate the links using a version of the BallKarrerNewman (bkn) model [2]. Each topic is associated with a link density . For each pair of documents and each topic , we independently generate a number of links which is Poissondistributed with mean . Since the sum of independent Poisson variables is Poisson, the total number of links between and is distributed as
(2) 
Since can exceed , this gives a random multigraph. In the data sets we study below, is or depending on whether cites , giving a simple graph. On the other hand, in the sparse case the event that has low probability in our model. Moreover, the fact that is Poissondistributed rather than Bernoulli makes the derivatives of the likelihood with respect to the parameters and very simple, allowing us to write down an efficient EM algorithm for inferring them.
This version of the model assumes that links are assortative, i.e., that links between documents only form to the extent that they belong to the same topic. One can easily generalize the model to include disassortative links as well, replacing with a matrix that allows documents with distinct topics to link [2].
We also consider degreecorrected versions of this model, where in addition to its topic mixture , each document has a propensity of forming links. In that case,
(3) 
We call this variant the Poisson MixedTopic Link Model with Degree Correction (pmtlmdc).
2.2 Prior Work on Content–Link Models
Most models for document networks generate content using either plsa [19], as we do, or lda [4]. The distinction is that plsa treats the document mixtures as parameters, while in lda they are hidden variables, integrated over a Dirichlet distribution. As we show in Section 3, our approach gives a simple, exact EM algorithm, avoiding the need for sampling or variational methods. While we do not impose a Dirichlet prior on in this paper, it is easy to add a corresponding term to the update equations for the EM algorithm, with no loss of efficiency.
There are a variety of methods in the literature to generate links between documents. phitsplsa [10], linklda [13] and linkplsalda [27] use the phits [9] model for link generation. phits treats each document as an additional term in the vocabulary, so two documents are similar if they link to the same documents. This is analogous to a mixture model for networks studied in [28]. In contrast, block models like ours treat documents as similar if they link to similar documents, as opposed to literally the same ones.
The pairwise linklda model [27], like ours, generates the links with a mixedtopic block model, although as in mmsb [1] and lda [4] it treats the as hidden variables integrated over a Dirichlet prior. They fit their model with a variational method that requires parameters, making it less scalable than our approach.
In the cpldc model [32], the link probability from to is determined by their topic mixtures and the popularity of
, which is drawn from a Gamma distribution with hyperparameters
and . Thus plays a role similar to the degreecorrecting parameter in our model, although we correct for the degree of as well. However, cpldc does not generate the content, but takes it as given.The Relational Topic Model (rtm) [5, 6] assumes that the link probability between and depends on the topics of the words appearing in their text. In contrast, our model uses the underlying topic mixtures to generate both the content and the links. Like our model, rtm defines the similarity of two topics as a weighted inner product of their topic mixtures: however, in rtm the probability of a link is a nonlinear function of this similarity, which can be logistic, exponential or normal, of this similarity.
Although it deals with a slightly different kind of dataset, our model is closest in spirit to the Latent Topic Hypertext Model (lthm) [18]. This is a generative model for hypertext networks, where each link from to is associated with a specific word in . If we sum over all words in , the total number of links from to that lthm
would generate follows a binomial distribution
(4) 
where is, in our terms, a degreecorrection parameter. When is large this becomes a Poisson distribution with mean . Our model differs from this in two ways: our parameters give a link density associated with each topic , and our degree correction does not assume that the number of links from is proportional to its length.
We briefly mention several other approaches. The authors of [16] extend the probabilistic relational model (prm) framework and proposed a unified generative model for both content and links in a relational structure. In [24], the authors proposed a linkbased model that describes both node attributes and links. The htm model [31] treats links as fixed rather than generating them, and only generates the text. Finally, the lmmg model [22] treats the appearance or absence of a word as a binary attribute of each document, and uses a logistic or exponential function of these attributes to determine the link probabilities.
3 A Scalable EM Algorithm
Here we describe an efficient ExpectationMaximization algorithm to find the maximumlikelihood estimates of the parameters of our model. Each update takes
time for a document network with topics, documents, and links, where is the sum over the documents of the number of distinct words in each one. Thus the running time per iteration is linear in the size of the corpus.For simplicity we describe the algorithm for the simpler version of our model, pmtlm. The algorithm for the degreecorrected version, pmtlmdc, is similar.
3.1 The likelihood
3.2 Balancing Content and Links
While we can use the total likelihood directly, in practice we can improve our performance significantly by better balancing the information in the content vs. that in the links. In particular, the loglikelihood of each document is proportional to its length, while its contribution to is proportional to its degree. Since a typical document has many more words than links, tends to be much larger than .
Following [19], we can provide this balance in two ways. One is to normalize by the length , and another is to add a parameter that reweights the relative contributions of the two terms and . We then maximize the function
(7) 
Varying from to
lets us interpolate between two extremes: studying the document network purely in terms of its topology, or purely in terms of the documents’ content. Indeed, we will see in Section
4 that the optimal value of depends on which task we are performing: closer to for link prediction, and closer to for topic classification.3.3 Update Equations and Running Time
We maximize as a function of using an EM algorithm, very similar to the one introduced by [2] for overlapping community detection. We start with a standard trick to change the log of a sum into a sum of logs, writing
(8) 
Here is the probability that a given appearance of in is due to topic , and is the probability that a given link from and is due to topic . This lower bound holds with equality when
(9) 
giving us the E step of the algorithm.
For the M step, we derive update equations for the parameters . By taking derivatives of the loglikelihood (7) (see the Appendix A for details) we obtain
(10)  
(11)  
(12) 
Here is the degree of document .
To analyze the running time, let denote the number of distinct words in document , and let . Then only of the parameters are nonzero. Similarly, only appears if , so in a network with links only of the are nonzero. The total number of nonzero terms appearing in (9)–(12), and hence the running time of the E and M steps, is thus .
As in [2], we can speed up the algorithm if is sparse, i.e. if many documents belong to fewer than topics, so that many of the are zero. According to (9), if then , in which case (12) implies that for all future iterations. If we choose a threshold below which is effectively zero, then as becomes sparser we can maintain just those and where . This in turn simplifies the updates for and in (10) and (11).
We note that the simplicity of our update equations comes from the fact that the is Poisson, and that its mean is a multilinear function of the parameters. Models where is Bernoullidistributed with a more complicated link probability, such as a logistic function, have more complicated derivatives of the likelihood, and therefore more complicated update equations.
Note also that this EM algorithm is exact, in the sense that the maximumlikelihood estimators are fixed points of the update equations. This is because the E step (9) is exact, since the conditional distribution of topics associated with each word occurrence and each link is a product distribution, which we can describe exactly with and . (There are typically multiple fixed points, so in practice we run our algorithm with many different initial conditions, and take the fixed point with the highest likelihood.)
This exactness is due to the fact that the topic mixtures are parameters to be inferred. In models such as lda and mmsb where
is a hidden variable integrated over a Dirichlet prior, the topics associated with each word and link have a complicated joint distribution that can only be approximated using sampling or variational methods. (To be fair, recent advances such as stochastic optimization based on network subsampling
[17] have shown that approximate inference in these models can be carried out quite efficiently.)On the other hand, in the context of finding communities in networks, models with Dirichlet priors have been observed to generalize more successfully than Poisson models such as bkn [17]. Happily, we can impose a Dirichlet prior on with no loss of efficiency, simply by including pseudocounts in the update equations—in essence adding additional words and links that are known to come from each topic (see Appendix C). This lets us obtain a maximum a posteriori (MAP) estimate of an ldalike model. We leave this as a direction for future work.
3.4 Discrete Labels and Local Search
Our model, like plsa and the bkn model, lets us infer a soft classification—a mixture of topic labels or community memberships for each document. However, we often want to infer categorical labels, where each document is assigned to a single topic . A natural way to do this is to let be the mostlikely label in the inferred mixture, . This is equivalent to rounding to a delta function, for and for .
If we wish, we can improve these discrete labels further using local search. If each document has just a single topic, the loglikelihood of our model is
(13)  
(14) 
Note that here is a matrix, with offdiagonal entries that allow documents with different topics to be linked. Otherwise, these discrete labels would cause the network to split into separate components.
Let denote the number of documents of topic , let be their total length, and let be the total number of times appears in them. Let denote the total number of links between documents of topics and , counting each link twice if . Then the MLEs for and are
(15) 
Applying these MLEs in (13) and (14) gives us a point estimate of the likelihood of a discrete topic assignment , which we can normalize or reweight as discussed in Section 3.2
if we like. We can then maximize this likelihood using local search: for instance, using the KernighanLin heuristic as in
[21] or a Monte Carlo algorithm to find a local maximum of the likelihood in the vicinity of . Each step of these algorithms changes the label of a single document , so we can update the values of , , , and and compute the new likelihood in time. In our experiments we used the KL heuristic, and found that for some data sets it noticeably improved the accuracy of our algorithm for the document classification task.4 Experimental Results
In this section we present empirical results on our model and our algorithm for unsupervised document classification and link prediction. We compare its accuracy and running time with those of several other methods, testing it on three realworld document citation networks.
4.1 Data Sets
The top portion of Table 1 lists the basic statistics for three realworld corpora [29]: Cora, Citeseer, and PubMed^{1}^{1}1These data sets are available for download at http://www.cs.umd.edu/projects/linqs/projects/lbc/
. Cora and Citeseer contain papers in machine learning, with
topics for Cora and for Citeseer. PubMed consists of medical research papers on topics, namely three types of diabetes. All three corpora have groundtruth topic labels provided by human curators.The data sets for these corpora are slightly different. The PubMed data set has the number of times each word appeared in each document, while the data for Cora and Citeseer records whether or not a word occurred at least once in the document. For Cora and Citeseer, we treat as being or .
4.2 Models and Implementations
We compare the Poisson MixedTopic Link Model (pmtlm) and its degreecorrected variant, denoted pmtlmdc, with phitsplsa, linklda, cpldc, and rtm (see Section 2.2). We used our own implementation of both phitsplsa and rtm. For rtm, we implemented the variational EM algorithm given in [6]. The implementation is based on the lda code available from the authors^{2}^{2}2See http://www.cs.princeton.edu/~blei/ldac/. We also tried the code provided by J. Chang^{3}^{3}3See http://www.cs.princeton.edu/~blei/lda/, which uses a Monte Carlo algorithm for the E step, but we found the variational algorithm works better on our data sets. While rtm
includes a variety of link probability functions, we only used the sigmoid function. We also assume a symmetric Dirichlet prior. The results for
linklda and cpldc are taken from [32].Each E and M step of the variational algorithm for rtm performs multiple iterations until they converge on estimates for the posterior and the parameters [6]. This is quite different from our EM algorithm: since our E step is exact, we update the parameters only once in each iteration. In our implementation, the convergence condition for the E step and for the entire EM algorithm are that the fractional increase of the loglikelihood between iterations is less than ; we performed a maximum of iterations of the rtm algorithm due to its greater running time. In order to optimize the parameters (see the graphical model in Section 2.2) rtm uses a tunable regularization parameter , which can be thought of as the number of observed nonlinks. We tried various settings for , namely and where is the number of observed links.
As described in Section 3.2, for pmtlm, pmtlmdc and phitsplsa we vary the relative weight of the likelihood of the content vs. the links, tuning to its best possible value. For the PubMed data set, we also normalized the content likelihood by the length of the documents.
4.3 Document Classification
4.3.1 Experimental Setting
For pmtlm, pmtlmdc and phitsplsa, we performed independent runs of the EM algorithm, each with random initial values of the parameters and topic mixtures. For each run we iterated the EM algorithm up to times; we found that it typically converges in fewer iterations, with the criterion that the fractional increase of the loglikelihood for two successive iterations is less than . Figure 2 shows that the loglikelihood as a function of the number of iterations are quite similar for all three data sets, even though these corpora have very different sizes. This indicates that even for large data sets, our algorithm converges within a small number of iterations, making its total running time linear in the size of the corpus.
For pmtlm and pmtlmdc, we obtain discrete topic labels by running our EM algorithm and rounding the topic mixtures as described in Section 3.4. We also tested improving these labels with local search, using the KernighanLin heuristic to change the label of one document at a time until we reach a local optimum of the likelihood. More precisely, of those runs, we took the best fixed points of the EM algorithm (i.e., with the highest likelihood) and attempted to improve them further with the KL heuristic. We used for Cora and Citeseer and for PubMed.
For rtm, in each E step, we initialize the variational parameters randomly, and in each M step we initialize the hyperparameters randomly. We execute 500 independent runs for each setting of the tunable parameter .
4.3.2 Metrics
For each algorithm, we used several measures of the accuracy of the inferred labels as compared to the humancurated ones. The Normalized Mutual Information (NMI) between two labelings and is defined as
(16) 
Here is the mutual information between and , and and are the entropies of and respectively. Thus the NMI is a measure of how much information the inferred labels give us about the true ones. We also used the Pairwise Fmeasure (PWF) [3] and the Variation of Information (VI) [25] (which we wish to minimize).
Cora  Citeseer  PubMed  
Statistics  7  6  3  
2,708  3,312  19,717  
5,429  4,608  44,335  
1,433  3,703  4,209  
49,216  105,165  1,333,397  
Time (sec)  EM (plsa)  28  61  362 
EM (phitsplsa)  40  67  445  
EM (pmtlm)  33  64  419  
EM (pmtlmdc)  36  64  402  
EM (rtm)  992  597  2,194  
KL (pmtlm)  375  618  13,723  
KL (pmtlmdc)  421  565  13,014 
Cora  Citeseer  PubMed  
Algorithm  NMI  VI  PWF  NMI  VI  PWF  NMI  VI  PWF 
phitsplsa  0.382 (.4)  2.285 (.4)  0.447 (.3)  0.366 (.5)  2.226 (.5)  0.480 (.5)  0.233 (1.0)  1.633 (1.0)  0.486 (1.0) 
linklda  —  —  —  —  —  
cpldc  —  —  —  —  —  
rtm  0.349  2.306  0.422  0.369  2.209  0.480  0.228  1.646  0.482 
pmtlm  0.467 (.4)  1.957 (.4)  0.509 (.3)  0.399 (.4)  2.106 (.4)  0.509 (.3)  0.232 (.9)  1.639 (1.0)  0.486 (.9) 
pmtlm (kl)  0.514 (.4)  1.778 (.4)  0.525 (.4)  0.414 (.6)  2.057 (.6)  0.518 (.5)  0.233 (.9)  1.642 (.9)  0.488 (.9) 
pmtlmdc  0.474 (.3)  1.930 (.3)  0.498 (.3)  0.402 (.3)  2.096 (.3)  0.518 (.3)  0.270 (.8)  1.556 (.8)  0.496 (.8) 
pmtlmdc (kl)  0.491 (.3)  1.865 (.3)  0.511 (.3)  0.406 (.3)  2.084 (.3)  0.520 (.3)  0.260 (.8)  1.577 (.8)  0.492 (.8) 
4.3.3 Results
The best NMI, VI, and PWF we observed for each algorithm are given in Table 2, where for linklda and cpldc we quote results from [32]. For rtm, we give these metrics for the labeling with the highest likelihood, using the best value of for each metric.
We see that even without the additional step of local search, our algorithm does very well, outperforming all other methods we tried on Citeseer and PubMed and all but cpldc on Cora. (Note that we did not test linklda or cpldc on PubMed.) Degree correction (pmtlmdc) improves accuracy significantly for PubMed.
Refining our labeling with the KL heuristic improved the performance of our algorithm significantly for Cora and Citeseer, giving us a higher accuracy than all the other methods we tested. For PubMed, local search did not increase accuracy in a statistically significant way. In fact, on some runs it decreased the accuracy slightly compared to the initial labeling obtained from our EM algorithm; this is counterintuitive, but it shows that increasing the likelihood of a labeling in the model can decrease its accuracy.
In Figure 3, we show how the performance of pmtlm, pmtlmdc, and phitsplsa varies as a function of , the relative weight of content vs. links. Recall that at these algorithms label documents solely on the basis of their links, while at they only pay attention to the content. Each point consists of the top 20 runs with that value of .
For Cora and Citeseer, there is an intermediate value of at which pmtlm and pmtlmdc have the best accuracy. However, this peak is fairly broad, showing that we do not have to tune very carefully. For PubMed, where we also normalized the content information by document length, pmtlmdc performs best at a particular value of .
We give the running time of these algorithms, including pmtlm and pmtlmdc with and without the kl heuristic, in Table 1, and compare it to the running time of the other algorithms we implemented. Our EM algorithm is much faster than the variational EM algorithm for rtm, and is scalable in that it grows linearly with the size of the corpus.
4.4 Link Prediction
Link prediction (e.g. [8, 23, 34]) is a natural generalization task in networks, and another way to measure the quality of our model and our EM algorithm. Based on a training set consisting of a subset of the links, our goal is to rank all pairs without an observed link according to the probability of a link between them. For our models, we rank pairs according to the expected number of links in the Poisson distribution, (2) and (3), which is monotonic in the probability that at least one link exists.
We can then predict links between those pairs where this probability exceeds some threshold. Since we are agnostic about this threshold and about the cost of Type I vs. Type II errors, we follow other work in this area by defining the accuracy of our model as the AUC, i.e. the probability that a random true positive link is ranked above a random true nonlink. Equivalently, this is the area under the
receiver operating characteristic curve (ROC). Our goal is to do better than the baseline AUC of , corresponding to a random ranking of the pairs.We carried out 10fold crossvalidation, in which the links in the original graph are partitioned into 10 subsets with equal size. For each fold, we use one subset as the test links, and train the model using the links in the other 9 folds. We evaluated the AUC on the heldout links and the nonlinks. For Cora and Citeseer, all the nonlinks are used. For PubMed, we randomly chose 10% of the nonlinks for comparison. We trained the models with the same settings as those for document classification in Section 4.3; we executed independent runs for each test. Note that unlike the document classification task, here we used the full topic mixtures to predict links, not just the discrete labels consisting of the mostlikely topic for each document.
Note that pmtlmdc assigns to be zero if the degree of is zero. This makes it impossible for to have any test link with others if its observed degree is zero in the training data. One way to solve this is to assign a small positive value to even if ’s degree is zero. Our approach assigns to be the smallest value among those that are nonzero:
(17) 



Figure 4(a) gives the AUC values for pmtlm and pmtlmdc as a function of the relative weight of content vs. links. The green horizontal line in each of those subplots represent the highest AUC value achieved by the rtm model for each data set, using the best value of among those specified in Section 4.3. Interestingly, for Cora and Citeseer the optimal value of is smaller than in Figure 3, showing that content is less important for link prediction than for document classification. We also plot the receiver operating characteristic (ROC) curves and precisionrecall curves that achieve the highest AUC values in Figure 4(b) and Figure 4(c) respectively.
We see that, for all three data sets, our models outperform rtm, and that the degreecorrected model pmtlmdc is significantly more accurate than the uncorrected one.
5 Conclusions
We have introduced a new generative model for document networks. It is a marriage between Probabilistic Latent Semantic Analysis [19] and the BallKarrerNewman mixed membership block model [2]
. Because of its mathematical simplicity, its parameters can be inferred with a particularly simple and scalable EM algorithm. Our experiments on both document classification and link prediction show that it achieves high accuracy and efficiency for a variety of data sets, outperforming a number of other methods. In future work, we plan to test its performance for other tasks including supervised and semisupervised learning, active learning, and content prediction, i.e., predicting the presence or absence of words in a document based on its links to other documents and/or a subset of its text.
6 Acknowledgments
We are grateful to Brian Ball, Brian Karrer, Mark Newman and David M. Blei for helpful conversations. Y.Z., X.Y., and C.M. are supported by AFOSR and DARPA under grant FA95501210432.
References
 [1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Machine Learning Research, 9:1981–2014, 2008.
 [2] B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, 2011.
 [3] S. Basu. Semisupervised Clustering: Probabilistic Models, Algorithms and Experiments. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2005.
 [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993–1022, 2003.
 [5] J. Chang and D. M. Blei. Relational topic models for document networks. Artificial Intelligence and Statistics, 2009.
 [6] J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124–150, Mar. 2010.
 [7] A. Chen, A. A. Amini, P. J. Bickel, and E. Levina. Fitting community models to large sparse networks. CoRR, abs/1207.2340, 2012.
 [8] A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.
 [9] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th Intl. Conf. on Machine Learning, pages 167–174, 2000.
 [10] D. Cohn and T. Hofmann. The missing link—a probabilistic model of document content and hypertext connectivity. Proc. 13th Neural Information Processing Systems, 2001.
 [11] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E, 84(6), 2011.

[12]
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová.
Inference and phase transitions in the detection of modules in sparse networks.
Phys. Rev. Lett., 107:065701, 2011.  [13] E. Erosheva, S. Fienberg, and J. Lafferty. Mixedmembership models of scientific publications. Proc. National Academy of Sciences, 101 Suppl:5220–7, Apr. 2004.
 [14] S. E. Fienberg and S. Wasserman. Categorical data analysis of single sociometric relations. sociological Methodology, pages 156–192, 1981.
 [15] L. Getoor and C. P. Diehl. Link mining: a survey. ACM SIGKDD Explorations Newsletter, 7(2):3–12, 2005.
 [16] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. Journal of Machine Learning Research, 3:679–707, December 2002.
 [17] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems 25, pages 2258–2266, 2012.
 [18] A. Gruber, M. RosenZvi, and Y. Weiss. Latent topic models for hypertext. Proc. 24th Conf. on Uncertainty in Artificial Intelligence, 2008.
 [19] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.
 [20] P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983.
 [21] B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83:016107, 2011.
 [22] M. Kim and J. Leskovec. Latent multigroup membership graph model. CoRR, abs/1205.4546, 2012.
 [23] L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150–1170, 2011.
 [24] Q. Lu and L. Getoor. Linkbased classification using labeled and unlabeled data. ICML Workshop on "The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003.
 [25] M. Meilă. Comparing clusterings by the variation of information. Learning theory and kernel machines, pages 173–187, 2003.
 [26] C. Moore, X. Yan, Y. Zhu, J. Rouquier, and T. Lane. Active learning for node classification in assortative and disassortative networks. In Proc. 17th KDD, pages 841–849, 2011.
 [27] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’08, page 542, 2008.
 [28] M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences of the United States of America, 104(23):9564–9, 2007.
 [29] P. Sen, G. Namata, M. Bilgic, and L. Getoor. Collective classification in network data. AI Magazine, pages 1–24, 2008.
 [30] T. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997.

[31]
C. Sun, B. Gao, Z. Cao, and H. Li.
HTM: a topic model for hypertexts.
In
Proc. Conf. on Empirical Methods in Natural Language Processing
, EMNLP ’08, pages 514–522, 2008.  [32] T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesian framework for community detection integrating content and link. In Proc. 25th Conf. on Uncertainty in Artificial Intelligence, pages 615–622, 2009.
 [33] P. Yu, J. Han, and C. Faloutsos. Link Mining: Models, Algorithms, and Applications. Springer, 2010.
 [34] Y. Zhao, E. Levina, and J. Zhu. Link prediction for partially observed networks. arXiv preprint arXiv:1301.7047, 2013.
Appendix A Update Equations for PMTLM
In this appendix, we derive the update equations (10)–(12) for the parameters , , and , giving the M step of our algorithm.
Recall that the likelihood is given by (7) and (3.3). For identifiability, we impose the normalization constraints
(18)  
(19) 
For each topic , taking the derivative of the likelihood with respect to gives
(20) 
Thus
(21) 
Plugging this in to (3.3) makes the last term a constant, . Thus we can ignore this term when estimating .
Similarly, for each topic and each word , taking the derivative with respect to gives
(22) 
where is the Lagrange multiplier for (18). Normalizing determines , and gives
(23) 
Finally, for each document and each topic , taking the derivative with respect to gives
(24) 
where is the Lagrange multiplier for (19). Normalizing determines and gives
(25) 
Appendix B Update Equations for the DegreeCorrected model
Recall that in the degreecorrected model pmtlmdc, the number of links between each pair of documents is Poissondistributed with mean
(26) 
To make the model identifiable, in addition to (18) and (19), we impose the following constraint on the degreecorrection parameters,
(27) 
With this constraint, we have
(28) 
The update equation (23) for remains the same, since the degreecorrection only affects the part of the model that generates the links, not the words. We now derive the update equations for , , and .
For each topic , taking the derivative of the likelihood with respect to gives
(29) 
where we used (27). Thus
(30) 
so is simply the expected number of links caused by topic . In particular,
(31) 
For , we have
(32) 
where is the Lagrange multiplier for (27). Thus
(33) 
We will determine below. However, note that multiplying both sides of (32) by , summing over , and applying (27) and (31) gives
(34) 
Most importantly, for we have
(35) 
where is the Lagrange multiplier for (19), and where we applied (27) in the second equality. Multiplying both sides of (35) by , summing over , and applying (33) gives
(36) 
Summing over and applying (27), (30), and (36) gives
(37) 
Thus measures how the inferred topic distributions of the words differ from the topic mixtures .
Appendix C Update Equations with Dirichlet Prior
If we impose a Dirichlet prior on , with parameters for each topic , this gives an additional term in the loglikelihood of both the pmtlm and pmtlmdc models. This is equivalent to introducing pseudocounts for each , which we can think of as additional words or links that we know are due to topic . Our original models, without this term, correspond to the uniform prior with and . However, as long as so that the pseudocounts are nonnegative, we can infer the parameters of our model in the same way with no loss of efficiency.
In the pmtlm model, (25) becomes
(39) 
In the degreecorrected model pmtlmdc, (36) and (B) become
(40) 
and
(41) 
Note that has two contributions. One measures, as before, how the inferred topic distributions of the words differ from the topic mixtures , and the other measures how the fraction of pseudocounts for topic differs from .