Scalable Text and Link Analysis with Mixed-Topic Link Models

Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.


page 1

page 2

page 3

page 4


Discovering Hidden Topical Hubs and Authorities in Online Social Networks

Finding influential users in online social networks is an important prob...

Random-walk Based Generative Model for Classifying Document Networks

Document networks are found in various collections of real-world data, s...

Memetic search for overlapping topics based on a local evaluation of link communities

In spite of recent advances in field delineation methods, bibliometricia...

Struct-MMSB: Mixed Membership Stochastic Blockmodels with Interpretable Structured Priors

The mixed membership stochastic blockmodel (MMSB) is a popular framework...

From Community Detection to Community Profiling

Most existing community-related studies focus on detection, which aim to...

A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data

Topic modeling based on latent Dirichlet allocation (LDA) has been a fra...

Component models for large networks

Being among the easiest ways to find meaningful structure from discrete ...

1 Introduction

Many modern data sets contain not only rich information about each object, but also pairwise relationships between them, forming networks where each object is a node and links represent the relationships. In document networks, for example, each node is a document containing a sequence of words, and the links between nodes are citations or hyperlinks. Both the content of the documents and the topology of the links between them are meaningful.

Over the past few years, two disparate communities have been approaching these data sets from different points of view. In the data mining community, the goal has been to augment traditional approaches to learning and data mining by including relations between objects [15, 33] for instance, to use the links between documents to help us label them by topic. In the network community, including its subset in statistical physics, the goal has been to augment traditional community structure algorithms such as the stochastic block model [14, 20, 30] by taking node attributes into account: for instance, to use the content of documents, rather than just the topological links between them, to help us understand their community structure.

In the original stochastic block model, each node has a discrete label, assigning it to one of communities. These labels, and the

matrix of probabilities with which a given pair of nodes with a given pair of labels have a link between them, can be inferred using Monte Carlo algorithms (e.g.

[26]) or, more efficiently, with belief propagation [12, 11] or pseudolikelihood approaches [7]. However, in real networks communities often overlap, and a given node can belong to multiple communities. This led to the mixed-membership block model [1], where the goal is to infer, for each node , a distribution or mixture of labels describing to what extent it belongs to each community. If we assume that links are assortative, i.e., that nodes are more likely to link to others in the same community, then the probability of a link between two nodes and depends on some measure of similarity (say, the inner product) of and .

These mixed-membership block models fit nicely with classic ideas in topic modeling. In models such as Probabilistic Latent Semantic Analysis (plsa[19] and Latent Dirichlet Allocation (lda[4], each document has a mixture

of topics. Each topic corresponds in turn to a probability distribution over words, and each word in

is generated independently from the resulting mixture of distributions. If we think of as both the mixture of topics for generating words and the mixture of communities for generating links, then we can infer jointly from the documents’ content and the presence or absence of links between them.

There are many possible such models, and we are far from the first to think along these lines. Our innovation is to take as our starting point a particular mixed-membership block model recently developed in the statistical physics community [2], which we refer to as the bkn model. It differs from the mixed-membership stochastic block model (mmsb) of [1] in several ways:

  1. The bkn model treats the community membership mixtures directly as parameters to be inferred. In contrast, mmsb treats

    as hidden variables generated by a Dirichlet distribution, and infers the hyperparameters of that distribution. The situation between

    plsa and lda is similar; plsa infers the topic mixtures , while lda generates them from a Dirichlet distribution.

  2. The mmsb

     model generates each link according to a Bernoulli distribution, with an extra parameter for sparsity. Instead,

    bkn treats the links as a random multigraph, where the number of links

    between each pair of nodes is Poisson-distributed. As a result, the derivatives of the log-likelihood with respect to

    and the other parameters are particularly simple.

These two factors make it possible to fit the bkn model using an efficient and exact expectation-maximization (EM) algorithm, making its inference highly scalable. The bkn model has another advantage as well:

  1. The bkn model is degree-corrected, in that it takes the observed degrees of the nodes into account when computing the expected number of edges between them. Thus it recognizes that two documents that have very different degrees might in fact have the same mix of topics; one may simply be more popular than the other.

In our work, we use a slight variant of the bkn model to generate the links, and we use plsa to generate the text. We present an EM algorithm for inferring the topic mixtures and other parameters. (While we do not impose a Dirichlet prior on the topic mixtures, it is easy to add a corresponding term to the update equations.) Our algorithm is scalable in the sense that each iteration takes time for networks with topics, documents, and links, where is the sum over documents of the number of distinct words appearing in each one. In practice, our EM algorithm converges within a small number of iterations, making the total running time linear in the size of the corpus.

Our model can be used for a variety of learning and generalization tasks, including document classification or link prediction. For document classification, we can obtain hard labels for each document by taking its most-likely topic with respect to , and optionally improve these labels further with local search. For link prediction, we train the model using a subset of the links, and then ask it to rank the remaining pairs of documents according to the probability of a link between them. For each task we determine the optimal relative weight of the content vs. the link information.

We performed experiments on three real-world data sets, with thousands of documents and millions of words. Our results show that our algorithm is more accurate, and considerably faster, than previous techniques for both document classification and link prediction.

The rest of the paper is organized as follows. Section 2 describes our generative model, and compares it with related models in the literature. Section 3 gives our EM algorithm and analyzes its running time. Section 4 contains our experimental results for document classification and link prediction, comparing our accuracy and running time with other techniques. In Section 5, we conclude, and offer some directions for further work.

2 Our Model and Previous Work

In this section, we give our proposed model, which we call the Poisson mixed-topic link model (pmtlm) and its degree-corrected variant pmtlm-dc.

2.1 The Generative Model

Consider a network of documents. Each document has a fixed length , and consists of a string of words for , where where is the number of distinct words. In addition, each pair of documents has an integer number of links connecting them, giving an adjacency matrix . There are topics, which play the dual role of the overlapping communities in the network.

Our model generates both the content and the links as follows. We generate the content using the plsa model [19]. Each topic is associated with a probability distribution over words, and each document has a probability distribution over topics. For each document and each , we independently

  1. choose a topic , and

  2. choose the word .

Thus the total probability that is a given word is


We assume that the number of topics is fixed. The distributions and are parameters to be inferred.

We generate the links using a version of the Ball-Karrer-Newman (bkn) model [2]. Each topic is associated with a link density . For each pair of documents and each topic , we independently generate a number of links which is Poisson-distributed with mean . Since the sum of independent Poisson variables is Poisson, the total number of links between and is distributed as


Since can exceed , this gives a random multigraph. In the data sets we study below, is or depending on whether cites , giving a simple graph. On the other hand, in the sparse case the event that has low probability in our model. Moreover, the fact that is Poisson-distributed rather than Bernoulli makes the derivatives of the likelihood with respect to the parameters and very simple, allowing us to write down an efficient EM algorithm for inferring them.

This version of the model assumes that links are assortative, i.e., that links between documents only form to the extent that they belong to the same topic. One can easily generalize the model to include disassortative links as well, replacing with a matrix that allows documents with distinct topics to link [2].

We also consider degree-corrected versions of this model, where in addition to its topic mixture , each document has a propensity of forming links. In that case,


We call this variant the Poisson Mixed-Topic Link Model with Degree Correction (pmtlm-dc).

(a) link-lda
(b) c-pldc
(c) rtm
(d) pmtlm-dc
Figure 1: Graphical models for link generation.

2.2 Prior Work on Content–Link Models

Most models for document networks generate content using either plsa [19], as we do, or lda [4]. The distinction is that plsa treats the document mixtures as parameters, while in lda they are hidden variables, integrated over a Dirichlet distribution. As we show in Section 3, our approach gives a simple, exact EM algorithm, avoiding the need for sampling or variational methods. While we do not impose a Dirichlet prior on in this paper, it is easy to add a corresponding term to the update equations for the EM algorithm, with no loss of efficiency.

There are a variety of methods in the literature to generate links between documents. phits-plsa [10], link-lda [13] and link-plsa-lda [27] use the phits [9] model for link generation. phits treats each document as an additional term in the vocabulary, so two documents are similar if they link to the same documents. This is analogous to a mixture model for networks studied in [28]. In contrast, block models like ours treat documents as similar if they link to similar documents, as opposed to literally the same ones.

The pairwise link-lda model [27], like ours, generates the links with a mixed-topic block model, although as in mmsb [1] and lda [4] it treats the as hidden variables integrated over a Dirichlet prior. They fit their model with a variational method that requires parameters, making it less scalable than our approach.

In the c-pldc model [32], the link probability from to is determined by their topic mixtures and the popularity of

, which is drawn from a Gamma distribution with hyperparameters

and . Thus plays a role similar to the degree-correcting parameter in our model, although we correct for the degree of as well. However, c-pldc does not generate the content, but takes it as given.

The Relational Topic Model (rtm[5, 6] assumes that the link probability between and depends on the topics of the words appearing in their text. In contrast, our model uses the underlying topic mixtures to generate both the content and the links. Like our model, rtm defines the similarity of two topics as a weighted inner product of their topic mixtures: however, in rtm the probability of a link is a nonlinear function of this similarity, which can be logistic, exponential or normal, of this similarity.

Although it deals with a slightly different kind of dataset, our model is closest in spirit to the Latent Topic Hypertext Model (lthm[18]. This is a generative model for hypertext networks, where each link from to is associated with a specific word in . If we sum over all words in , the total number of links from to that lthm

 would generate follows a binomial distribution


where is, in our terms, a degree-correction parameter. When is large this becomes a Poisson distribution with mean . Our model differs from this in two ways: our parameters give a link density associated with each topic , and our degree correction does not assume that the number of links from is proportional to its length.

We briefly mention several other approaches. The authors of [16] extend the probabilistic relational model (prm) framework and proposed a unified generative model for both content and links in a relational structure. In [24], the authors proposed a link-based model that describes both node attributes and links. The htm model [31] treats links as fixed rather than generating them, and only generates the text. Finally, the lmmg model [22] treats the appearance or absence of a word as a binary attribute of each document, and uses a logistic or exponential function of these attributes to determine the link probabilities.

In Section 4 below, we compare our model to phits-plsa, link-lda, c-pldc, and rtm. Graphical models for the link generation components of these models, and ours, are shown in Figure 1.

3 A Scalable EM Algorithm

Here we describe an efficient Expectation-Maximization algorithm to find the maximum-likelihood estimates of the parameters of our model. Each update takes

time for a document network with topics, documents, and links, where is the sum over the documents of the number of distinct words in each one. Thus the running time per iteration is linear in the size of the corpus.

For simplicity we describe the algorithm for the simpler version of our model, pmtlm. The algorithm for the degree-corrected version, pmtlm-dc, is similar.

3.1 The likelihood

Let denote the number of times a word appears in document . From (1), the log-likelihood of ’s content is


Similarly, from (2), the log-likelihood for the links is


We ignore the constant term from the denominator of the Poisson distribution, since it has no bearing on the parameters.

3.2 Balancing Content and Links

While we can use the total likelihood directly, in practice we can improve our performance significantly by better balancing the information in the content vs. that in the links. In particular, the log-likelihood of each document is proportional to its length, while its contribution to is proportional to its degree. Since a typical document has many more words than links, tends to be much larger than .

Following [19], we can provide this balance in two ways. One is to normalize by the length , and another is to add a parameter that reweights the relative contributions of the two terms and . We then maximize the function


Varying from to

lets us interpolate between two extremes: studying the document network purely in terms of its topology, or purely in terms of the documents’ content. Indeed, we will see in Section 

4 that the optimal value of depends on which task we are performing: closer to for link prediction, and closer to for topic classification.

3.3 Update Equations and Running Time

We maximize as a function of using an EM algorithm, very similar to the one introduced by [2] for overlapping community detection. We start with a standard trick to change the log of a sum into a sum of logs, writing


Here is the probability that a given appearance of in is due to topic , and is the probability that a given link from and is due to topic . This lower bound holds with equality when


giving us the E step of the algorithm.

For the M step, we derive update equations for the parameters . By taking derivatives of the log-likelihood (7) (see the Appendix A for details) we obtain


Here is the degree of document .

To analyze the running time, let denote the number of distinct words in document , and let . Then only of the parameters are nonzero. Similarly, only appears if , so in a network with links only of the are nonzero. The total number of nonzero terms appearing in (9)–(12), and hence the running time of the E and M steps, is thus .

As in [2], we can speed up the algorithm if is sparse, i.e. if many documents belong to fewer than topics, so that many of the are zero. According to (9), if then , in which case (12) implies that for all future iterations. If we choose a threshold below which is effectively zero, then as becomes sparser we can maintain just those and where . This in turn simplifies the updates for and in (10) and (11).

We note that the simplicity of our update equations comes from the fact that the is Poisson, and that its mean is a multilinear function of the parameters. Models where is Bernoulli-distributed with a more complicated link probability, such as a logistic function, have more complicated derivatives of the likelihood, and therefore more complicated update equations.

Note also that this EM algorithm is exact, in the sense that the maximum-likelihood estimators are fixed points of the update equations. This is because the E step (9) is exact, since the conditional distribution of topics associated with each word occurrence and each link is a product distribution, which we can describe exactly with and . (There are typically multiple fixed points, so in practice we run our algorithm with many different initial conditions, and take the fixed point with the highest likelihood.)

This exactness is due to the fact that the topic mixtures are parameters to be inferred. In models such as lda and mmsb where

is a hidden variable integrated over a Dirichlet prior, the topics associated with each word and link have a complicated joint distribution that can only be approximated using sampling or variational methods. (To be fair, recent advances such as stochastic optimization based on network subsampling 

[17] have shown that approximate inference in these models can be carried out quite efficiently.)

On the other hand, in the context of finding communities in networks, models with Dirichlet priors have been observed to generalize more successfully than Poisson models such as bkn [17]. Happily, we can impose a Dirichlet prior on with no loss of efficiency, simply by including pseudocounts in the update equations—in essence adding additional words and links that are known to come from each topic (see Appendix C). This lets us obtain a maximum a posteriori (MAP) estimate of an lda-like model. We leave this as a direction for future work.

3.4 Discrete Labels and Local Search

Our model, like plsa and the bkn model, lets us infer a soft classification—a mixture of topic labels or community memberships for each document. However, we often want to infer categorical labels, where each document is assigned to a single topic . A natural way to do this is to let be the most-likely label in the inferred mixture, . This is equivalent to rounding to a delta function, for and for .

If we wish, we can improve these discrete labels further using local search. If each document has just a single topic, the log-likelihood of our model is


Note that here is a matrix, with off-diagonal entries that allow documents with different topics to be linked. Otherwise, these discrete labels would cause the network to split into separate components.

Let denote the number of documents of topic , let be their total length, and let be the total number of times appears in them. Let denote the total number of links between documents of topics and , counting each link twice if . Then the MLEs for and are


Applying these MLEs in (13) and (14) gives us a point estimate of the likelihood of a discrete topic assignment , which we can normalize or reweight as discussed in Section 3.2

if we like. We can then maximize this likelihood using local search: for instance, using the Kernighan-Lin heuristic as in 

[21] or a Monte Carlo algorithm to find a local maximum of the likelihood in the vicinity of . Each step of these algorithms changes the label of a single document , so we can update the values of , , , and and compute the new likelihood in time. In our experiments we used the KL heuristic, and found that for some data sets it noticeably improved the accuracy of our algorithm for the document classification task.

4 Experimental Results

In this section we present empirical results on our model and our algorithm for unsupervised document classification and link prediction. We compare its accuracy and running time with those of several other methods, testing it on three real-world document citation networks.

4.1 Data Sets

The top portion of Table 1 lists the basic statistics for three real-world corpora [29]: Cora, Citeseer, and PubMed111These data sets are available for download at

. Cora and Citeseer contain papers in machine learning, with

topics for Cora and for Citeseer. PubMed consists of medical research papers on topics, namely three types of diabetes. All three corpora have ground-truth topic labels provided by human curators.

The data sets for these corpora are slightly different. The PubMed data set has the number of times each word appeared in each document, while the data for Cora and Citeseer records whether or not a word occurred at least once in the document. For Cora and Citeseer, we treat as being or .

4.2 Models and Implementations

We compare the Poisson Mixed-Topic Link Model (pmtlm) and its degree-corrected variant, denoted pmtlm-dc, with phits-plsa, link-lda, c-pldc, and rtm (see Section 2.2). We used our own implementation of both phits-plsa and rtm. For rtm, we implemented the variational EM algorithm given in [6]. The implementation is based on the lda code available from the authors222See We also tried the code provided by J. Chang333See, which uses a Monte Carlo algorithm for the E step, but we found the variational algorithm works better on our data sets. While rtm

 includes a variety of link probability functions, we only used the sigmoid function. We also assume a symmetric Dirichlet prior. The results for

link-lda and c-pldc are taken from [32].

Each E and M step of the variational algorithm for rtm performs multiple iterations until they converge on estimates for the posterior and the parameters [6]. This is quite different from our EM algorithm: since our E step is exact, we update the parameters only once in each iteration. In our implementation, the convergence condition for the E step and for the entire EM algorithm are that the fractional increase of the log-likelihood between iterations is less than ; we performed a maximum of iterations of the rtm algorithm due to its greater running time. In order to optimize the parameters (see the graphical model in Section 2.2) rtm uses a tunable regularization parameter , which can be thought of as the number of observed non-links. We tried various settings for , namely and where is the number of observed links.

As described in Section 3.2, for pmtlm, pmtlm-dc and phits-plsa we vary the relative weight of the likelihood of the content vs. the links, tuning to its best possible value. For the PubMed data set, we also normalized the content likelihood by the length of the documents.

4.3 Document Classification

4.3.1 Experimental Setting

For pmtlm, pmtlm-dc and phits-plsa, we performed independent runs of the EM algorithm, each with random initial values of the parameters and topic mixtures. For each run we iterated the EM algorithm up to times; we found that it typically converges in fewer iterations, with the criterion that the fractional increase of the log-likelihood for two successive iterations is less than . Figure 2 shows that the log-likelihood as a function of the number of iterations are quite similar for all three data sets, even though these corpora have very different sizes. This indicates that even for large data sets, our algorithm converges within a small number of iterations, making its total running time linear in the size of the corpus.

Figure 2: The log-likelihood of the PMTLM and PMTLM-DC models as a function of the number of EM iterations, normalized so that and are the initial and final log-likelihood respectively. The convergence is roughly the same for all three data sets, showing that the number of iterations is roughly constant as a function of the size of the corpus.

For pmtlm and pmtlm-dc, we obtain discrete topic labels by running our EM algorithm and rounding the topic mixtures as described in Section 3.4. We also tested improving these labels with local search, using the Kernighan-Lin heuristic to change the label of one document at a time until we reach a local optimum of the likelihood. More precisely, of those runs, we took the best fixed points of the EM algorithm (i.e., with the highest likelihood) and attempted to improve them further with the KL heuristic. We used for Cora and Citeseer and for PubMed.

For rtm, in each E step, we initialize the variational parameters randomly, and in each M step we initialize the hyperparameters randomly. We execute 500 independent runs for each setting of the tunable parameter .

4.3.2 Metrics

For each algorithm, we used several measures of the accuracy of the inferred labels as compared to the human-curated ones. The Normalized Mutual Information (NMI) between two labelings and is defined as


Here is the mutual information between and , and and are the entropies of and respectively. Thus the NMI is a measure of how much information the inferred labels give us about the true ones. We also used the Pairwise F-measure (PWF) [3] and the Variation of Information (VI) [25] (which we wish to minimize).

Cora Citeseer PubMed
Statistics 7 6 3
2,708 3,312 19,717
5,429 4,608 44,335
1,433 3,703 4,209
49,216 105,165 1,333,397
Time (sec) EM (plsa) 28 61 362
EM (phits-plsa) 40 67 445
EM (pmtlm) 33 64 419
EM (pmtlm-dc) 36 64 402
EM (rtm) 992 597 2,194
KL  (pmtlm) 375 618 13,723
KL  (pmtlm-dc) 421 565 13,014
Table 1: The statistics of the three data sets, and the mean running time, for the EM algorithms in our model PMTLM, its degree-corrected variant PMTLM-DC, and PLSA, PHITS-PLSA, and RTM. Each corpus has topics, documents, links, a vocabulary of size , and a total size . Running times for our algorithm, PLSA, and PHITS-PLSA are given for one run of EM iterations. Running times for RTM consist of up to 500 EM iterations, or until the convergence criteria are reached. Our EM algorithm is highly scalable, with a running time that grows linearly with the size of the corpus. In particular, it is much faster that the variational algorithm for RTM. Improving discrete labels with the Kernighan-Lin heuristic (KL) increases our algorithm’s running time, but improves its accuracy for document classification in Cora and Citeseer.
Cora Citeseer PubMed
phits-plsa 0.382 (.4) 2.285 (.4) 0.447 (.3) 0.366 (.5) 2.226 (.5) 0.480 (.5) 0.233 (1.0) 1.633 (1.0) 0.486 (1.0)
rtm 0.349 2.306 0.422 0.369 2.209 0.480 0.228 1.646 0.482
pmtlm 0.467 (.4) 1.957 (.4) 0.509 (.3) 0.399 (.4) 2.106 (.4) 0.509 (.3) 0.232 (.9) 1.639 (1.0) 0.486 (.9)
pmtlm (kl) 0.514 (.4) 1.778 (.4) 0.525 (.4) 0.414 (.6) 2.057 (.6) 0.518 (.5) 0.233 (.9) 1.642 (.9) 0.488 (.9)
pmtlm-dc 0.474 (.3) 1.930 (.3) 0.498 (.3) 0.402 (.3) 2.096 (.3) 0.518 (.3) 0.270 (.8) 1.556 (.8) 0.496 (.8)
pmtlm-dc (kl) 0.491 (.3) 1.865 (.3) 0.511 (.3) 0.406 (.3) 2.084 (.3) 0.520 (.3) 0.260 (.8) 1.577 (.8) 0.492 (.8)
Table 2: The best normalized mutual information (NMI), variational of information (VI) and pairwise F-measure (PWF) achieved by each algorithm. Values marked by are quoted from [32]; other values are based on our implementation. The best values are shown in bold; note that we seek to maximize NMI and PWF, and minimize VI. For PHITS-PLSA, PMTLM, and PMTLM-DC, the number in parentheses is the best value of the relative weight of content vs. links. Refining the labeling returned by the EM algorithm with the Kernighan-Lin heuristic is indicated by (KL).

4.3.3 Results

The best NMI, VI, and PWF we observed for each algorithm are given in Table 2, where for link-lda and c-pldc we quote results from [32]. For rtm, we give these metrics for the labeling with the highest likelihood, using the best value of for each metric.

We see that even without the additional step of local search, our algorithm does very well, outperforming all other methods we tried on Citeseer and PubMed and all but c-pldc on Cora. (Note that we did not test link-lda or c-pldc on PubMed.) Degree correction (pmtlm-dc) improves accuracy significantly for PubMed.

Refining our labeling with the KL heuristic improved the performance of our algorithm significantly for Cora and Citeseer, giving us a higher accuracy than all the other methods we tested. For PubMed, local search did not increase accuracy in a statistically significant way. In fact, on some runs it decreased the accuracy slightly compared to the initial labeling obtained from our EM algorithm; this is counterintuitive, but it shows that increasing the likelihood of a labeling in the model can decrease its accuracy.

In Figure 3, we show how the performance of pmtlm, pmtlm-dc, and phits-plsa varies as a function of , the relative weight of content vs. links. Recall that at these algorithms label documents solely on the basis of their links, while at they only pay attention to the content. Each point consists of the top 20 runs with that value of .

For Cora and Citeseer, there is an intermediate value of at which pmtlm and pmtlm-dc have the best accuracy. However, this peak is fairly broad, showing that we do not have to tune very carefully. For PubMed, where we also normalized the content information by document length, pmtlm-dc performs best at a particular value of .

We give the running time of these algorithms, including pmtlm and pmtlm-dc with and without the kl heuristic, in Table 1, and compare it to the running time of the other algorithms we implemented. Our EM algorithm is much faster than the variational EM algorithm for rtm, and is scalable in that it grows linearly with the size of the corpus.

4.4 Link Prediction

Link prediction (e.g. [8, 23, 34]) is a natural generalization task in networks, and another way to measure the quality of our model and our EM algorithm. Based on a training set consisting of a subset of the links, our goal is to rank all pairs without an observed link according to the probability of a link between them. For our models, we rank pairs according to the expected number of links in the Poisson distribution, (2) and (3), which is monotonic in the probability that at least one link exists.

We can then predict links between those pairs where this probability exceeds some threshold. Since we are agnostic about this threshold and about the cost of Type I vs. Type II errors, we follow other work in this area by defining the accuracy of our model as the AUC, i.e. the probability that a random true positive link is ranked above a random true non-link. Equivalently, this is the area under the

receiver operating characteristic curve (ROC). Our goal is to do better than the baseline AUC of , corresponding to a random ranking of the pairs.

We carried out 10-fold cross-validation, in which the links in the original graph are partitioned into 10 subsets with equal size. For each fold, we use one subset as the test links, and train the model using the links in the other 9 folds. We evaluated the AUC on the held-out links and the non-links. For Cora and Citeseer, all the non-links are used. For PubMed, we randomly chose 10% of the non-links for comparison. We trained the models with the same settings as those for document classification in Section 4.3; we executed independent runs for each test. Note that unlike the document classification task, here we used the full topic mixtures to predict links, not just the discrete labels consisting of the most-likely topic for each document.

Note that pmtlm-dc assigns to be zero if the degree of is zero. This makes it impossible for to have any test link with others if its observed degree is zero in the training data. One way to solve this is to assign a small positive value to even if ’s degree is zero. Our approach assigns to be the smallest value among those that are non-zero:

Figure 3: The accuracy of PMTLM, PMTLM-DC, and PHITS-PLSA on the document classification task, measured by the NMI, as a function of the relative weight of the content vs. the links. At these algorithms label documents solely on the basis of their links, while at they pay attention only to the content. For Cora and Citeseer, there is a broad range of that maximizes the accuracy. For PubMed, the degree-corrected model PMTLM-DC performs best at a particular value of .
(a) AUC values for different .
(b) ROC curves achieving the highest AUC values.
(c) Precision-recall curves achieving the highest AUC values.
Figure 4: Performance on the link prediction task. For all three data sets and all the values, the PMTLM-DC model achieves higher accuracy than the PMTLM model. In contrast to Figure 3, for this task the optimal value of is relatively small, showing that the content is less important, and the topology is more important, for link prediction than for document classification. The green line in Figure 4(a) indicates the highest AUC achieved by the RTM model, maximized over the tunable parameter . Our models outperform RTM on all three data sets. In addition, the degree-corrected model (PMTLM-DC) does significantly better than the uncorrected version (PMTLM).

Figure 4(a) gives the AUC values for pmtlm and pmtlm-dc as a function of the relative weight of content vs. links. The green horizontal line in each of those subplots represent the highest AUC value achieved by the rtm model for each data set, using the best value of among those specified in Section 4.3. Interestingly, for Cora and Citeseer the optimal value of is smaller than in Figure 3, showing that content is less important for link prediction than for document classification. We also plot the receiver operating characteristic (ROC) curves and precision-recall curves that achieve the highest AUC values in Figure 4(b) and Figure 4(c) respectively.

We see that, for all three data sets, our models outperform rtm, and that the degree-corrected model pmtlm-dc is significantly more accurate than the uncorrected one.

5 Conclusions

We have introduced a new generative model for document networks. It is a marriage between Probabilistic Latent Semantic Analysis [19] and the Ball-Karrer-Newman mixed membership block model [2]

. Because of its mathematical simplicity, its parameters can be inferred with a particularly simple and scalable EM algorithm. Our experiments on both document classification and link prediction show that it achieves high accuracy and efficiency for a variety of data sets, outperforming a number of other methods. In future work, we plan to test its performance for other tasks including supervised and semisupervised learning, active learning, and content prediction, i.e., predicting the presence or absence of words in a document based on its links to other documents and/or a subset of its text.

6 Acknowledgments

We are grateful to Brian Ball, Brian Karrer, Mark Newman and David M. Blei for helpful conversations. Y.Z., X.Y., and C.M. are supported by AFOSR and DARPA under grant FA9550-12-1-0432.


  • [1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Machine Learning Research, 9:1981–2014, 2008.
  • [2] B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, 2011.
  • [3] S. Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2005.
  • [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993–1022, 2003.
  • [5] J. Chang and D. M. Blei. Relational topic models for document networks. Artificial Intelligence and Statistics, 2009.
  • [6] J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124–150, Mar. 2010.
  • [7] A. Chen, A. A. Amini, P. J. Bickel, and E. Levina. Fitting community models to large sparse networks. CoRR, abs/1207.2340, 2012.
  • [8] A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98–101, 2008.
  • [9] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th Intl. Conf. on Machine Learning, pages 167–174, 2000.
  • [10] D. Cohn and T. Hofmann. The missing link—a probabilistic model of document content and hypertext connectivity. Proc. 13th Neural Information Processing Systems, 2001.
  • [11] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E, 84(6), 2011.
  • [12] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová.

    Inference and phase transitions in the detection of modules in sparse networks.

    Phys. Rev. Lett., 107:065701, 2011.
  • [13] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proc. National Academy of Sciences, 101 Suppl:5220–7, Apr. 2004.
  • [14] S. E. Fienberg and S. Wasserman. Categorical data analysis of single sociometric relations. sociological Methodology, pages 156–192, 1981.
  • [15] L. Getoor and C. P. Diehl. Link mining: a survey. ACM SIGKDD Explorations Newsletter, 7(2):3–12, 2005.
  • [16] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. Journal of Machine Learning Research, 3:679–707, December 2002.
  • [17] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems 25, pages 2258–2266, 2012.
  • [18] A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Proc. 24th Conf. on Uncertainty in Artificial Intelligence, 2008.
  • [19] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.
  • [20] P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983.
  • [21] B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83:016107, 2011.
  • [22] M. Kim and J. Leskovec. Latent multi-group membership graph model. CoRR, abs/1205.4546, 2012.
  • [23] L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150–1170, 2011.
  • [24] Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. ICML Workshop on "The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003.
  • [25] M. Meilă. Comparing clusterings by the variation of information. Learning theory and kernel machines, pages 173–187, 2003.
  • [26] C. Moore, X. Yan, Y. Zhu, J. Rouquier, and T. Lane. Active learning for node classification in assortative and disassortative networks. In Proc. 17th KDD, pages 841–849, 2011.
  • [27] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’08, page 542, 2008.
  • [28] M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences of the United States of America, 104(23):9564–9, 2007.
  • [29] P. Sen, G. Namata, M. Bilgic, and L. Getoor. Collective classification in network data. AI Magazine, pages 1–24, 2008.
  • [30] T. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997.
  • [31] C. Sun, B. Gao, Z. Cao, and H. Li. HTM: a topic model for hypertexts. In

    Proc. Conf. on Empirical Methods in Natural Language Processing

    , EMNLP ’08, pages 514–522, 2008.
  • [32] T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesian framework for community detection integrating content and link. In Proc. 25th Conf. on Uncertainty in Artificial Intelligence, pages 615–622, 2009.
  • [33] P. Yu, J. Han, and C. Faloutsos. Link Mining: Models, Algorithms, and Applications. Springer, 2010.
  • [34] Y. Zhao, E. Levina, and J. Zhu. Link prediction for partially observed networks. arXiv preprint arXiv:1301.7047, 2013.

Appendix A Update Equations for PMTLM

In this appendix, we derive the update equations (10)–(12) for the parameters , , and , giving the M step of our algorithm.

Recall that the likelihood is given by (7) and (3.3). For identifiability, we impose the normalization constraints


For each topic , taking the derivative of the likelihood with respect to gives




Plugging this in to (3.3) makes the last term a constant, . Thus we can ignore this term when estimating .

Similarly, for each topic and each word , taking the derivative with respect to gives


where is the Lagrange multiplier for (18). Normalizing determines , and gives


Finally, for each document and each topic , taking the derivative with respect to gives


where is the Lagrange multiplier for (19). Normalizing determines and gives


Appendix B Update Equations for the Degree-Corrected model

Recall that in the degree-corrected model pmtlm-dc, the number of links between each pair of documents is Poisson-distributed with mean


To make the model identifiable, in addition to (18) and (19), we impose the following constraint on the degree-correction parameters,


With this constraint, we have


The update equation (23) for remains the same, since the degree-correction only affects the part of the model that generates the links, not the words. We now derive the update equations for , , and .

For each topic , taking the derivative of the likelihood with respect to gives


where we used (27). Thus


so is simply the expected number of links caused by topic . In particular,


For , we have


where is the Lagrange multiplier for (27). Thus


We will determine below. However, note that multiplying both sides of (32) by , summing over , and applying (27) and (31) gives


Most importantly, for we have


where is the Lagrange multiplier for (19), and where we applied (27) in the second equality. Multiplying both sides of (35) by , summing over , and applying (33) gives


Summing over and applying (27), (30), and (36) gives


Thus measures how the inferred topic distributions of the words differ from the topic mixtures .

Finally, (35) and (36) give


where and are given by (30) and (B).

Appendix C Update Equations with Dirichlet Prior

If we impose a Dirichlet prior on , with parameters for each topic , this gives an additional term in the log-likelihood of both the pmtlm and pmtlm-dc models. This is equivalent to introducing pseudocounts for each , which we can think of as additional words or links that we know are due to topic . Our original models, without this term, correspond to the uniform prior with and . However, as long as so that the pseudocounts are nonnegative, we can infer the parameters of our model in the same way with no loss of efficiency.

In the pmtlm model, (25) becomes


In the degree-corrected model pmtlm-dc, (36) and (B) become




Note that has two contributions. One measures, as before, how the inferred topic distributions of the words differ from the topic mixtures , and the other measures how the fraction of pseudocounts for topic differs from .

Finally, (38) becomes


where and are given by (30) and (C).