Nonparametric Relational Topic Models through Dependent Gamma Processes

03/30/2015 ∙ by Junyu Xuan, et al. ∙ University of Technology Sydney Shanghai University 0

Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known in advance, and this is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model in this paper. Instead of using fixed-dimensional probability distributions in its generative model, we use stochastic processes. Specifically, a gamma process is assigned to each document, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different gamma process from the global gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Understanding a corpus is significant for businesses, organizations and individuals for instance the academic papers of IEEE, the emails in an organization and the previously browsed webpages of a person. One commonly accepted and successful way to understand a corpus is to discover the hidden topics in the corpus [1, 2]. The revealed hidden topics could improve the services of IEEE, such as the ability to search, browse or visualize academic papers ; help an organization understand and resolve the concerns of its employees; help internet browsers understand the interests of a person and then provide accurate personalized services. Furthermore, there are normally links between the documents in a corpus. A paper citation network [3] is an example of a document network in which the academic papers are linked by their citation relations; an email network [4] is a document network in which the emails are linked by their reply relations; a webpage network [5] is a document network in which webpages are linked by their hyperlinks. Since these links also express the nature of the documents, it is apparent that hidden topic discovery should consider these links as well.

Similar studies focusing on the hidden topics discovering from the document network using some Relational Topic Models (RTM) [6, 7, 8] have already been successfully developed. Unlike the traditional topic models [1, 2] that focus on mining the hidden topics from a document corpus (without links between documents), the RTM can make discovered topics inherit the document network structure. The links between documents can be considered as constrains of the hidden topics.

One drawback of existing RTMs is that they are built with fixed-dimensional probability distributions, such as Dirichlet, Multinomial, Gamma and Possion distribution, which require their dimensions be fixed before use. Hence, the number of hidden topics must be specified in advance, and is normally chosen using domain knowledge. This is difficult and unrealistic in many real-world applications, so RTMs fail to find the number of topics in a document network.

In order to overcome this drawback, we propose a Nonparametric Relational Topic (NRT) model in this paper, which removes the necessity of fixing the topic number. Instead of probability distributions, stochastic processes are adopted by the proposed model. Stochastic process can be simply considered as ‘infinite’ dimensional distributions111We only consider the pure-jump processes in this paper. Some continuous processes cannot be simply considered as the ‘infinite’ dimensional distributions.. In order to express the interest of a document on the ‘infinite’ number of topics, we assign each document a Gamma process that has infinite components. An additional requirement for the Gamma process assignment is that the two linked documents should have a tendency to share similar topics. This is a common feature found in many real-world applications, and many literatures [6, 7, 8] have exploited this property in their work. In order to achieve the above requirement, we have formally defined two properties that any relational topic model of a document network should satisfy. First we use a global gamma process to represent a set of base components that is shared by all documents. This is important because users are not interested in analyzing documents in a database without sharing any common topics [9]. Our model achieves the defined properties through: 1) thinning the global gamma process with document-dependent probabilities; 2) adding a Markov Random Field constraint to the thinning probabilities to retain the network structure. Finally, we assign each document with a gamma process that inherits both the content of the document and the link structure. Two sampling algorithms are designed to learn the proposed model under different conditions. Experiments with document networks show some efficiency in learning what the hidden topics are and superior performance the model’s ability to learn the number of hidden topics. It is worth noting that, although we use document networks as examples throughout this paper, our work can be applied to other networks with node features.

The main contributions of this paper are to:

  1. propose a new nonparametric Bayesian model which can relax the topic number assumption used in the traditional relational topic models;

  2. design two sampling inference algorithms for the proposed model: a truncated version and an exact version.

The rest paper is structured as follows. Section ii@ summarizes the related work. The proposed NRT model is presented in Section iii@ and we have illustrated the detailed derivations of its sampling inference in Section iv@. Section v@ presents experimental results both on the synthetic and real-world data. Finally, Section vi@ concludes this study with a discussion on future directions.

Ii Related Work

In this section, we briefly review the related work of this paper. The first part summarizes the literature on relational topic models. The second part summarizes the literatures on nonparametric Bayesian learning.

Ii-a Topic Models with network

Our work in this paper aims to model the data with the network structure as a constraint. Since social network and citation network are two explicit and commonly-used networks in the data mining and machine learning areas, some extensions of the traditional topic models try to adapt to these networks. For the social network, an Author-Recipient-Topic model

[10] was proposed to analyze the categories of roles in social networks based on the relationships of people in the network. A similar task was investigated in [11] where social network structure was inferred from informal chat-room conversations utilizing the topic model [12]. The ‘noisy links’ and ‘popularity bias’ of social network was addressed by a properly designed topic model in [13] and [14]. As an important issue of social network analysis, communities [15] were extracted using a Social Topic Model [16]

. The Mixed Membership Stochastic Blockmodel is another way to learn the mixed membership vector (i.e., topic distribution) for each node from a network structure

[17], but it did not consider the content/features of each node. For the citation network, Relational Topic Model (RTM) was proposed to infer the topics [6], discriminative topics [8] and hierarchical topics [7] from citation networks by introducing a link variable between two linked documents. Unlike RTM, a block was adopted to model the link between two document [18, 19]. Considering the physical meaning of citation relations, a variable was introduced to indicate if the content of citing paper was inherited from cited paper or not [20, 21]. In order to keep the document structure, Markov Random Field (MRF) was combined with topic model [22]. The communities in citation network were also investigated [23].

In summary, existing relational topic models are all inherited from traditional topic models, so the number of topics needs to be fixed. It is unrealistic, in many real-world situations, to fix this number in advance. Our work tries to resolve this issue through the nonparametric learning techniques reviewed in the following subsection.

Ii-B Relational Topic Model

Fig. 1: Finite Relational-Topic-Model

Since the finite relational topic models are our comparative model, we introduce a relational topic model [6] here in detail. The corresponding graphical representation is shown in Fig. 1, and the generative process is as follows,

(1)

where is the topic distribution of a document, is the word distribution of topic , is the topic index of word in document , is observed word in document . All these variables are same with the original LDA [1]. The different and significant part is the variable , which denotes the observed document link. This model uses a Generalized Linear Model [24] to model the generation of the document links.

(2)

One problem with this model is that the number of topics needs to be pre-defined and for some real-world applications, this is not trivial.

Ii-C Nonparametric Bayesian Learning

Nonparametric Bayesian learning [25] is a key approach for learning the number of mixtures in a mixture model (also called the model selection problem). Without predefining the number of mixtures, this number is supposed to be inferred from the data, i.e., let the data speak.

The traditional elements of probabilistic models are fixed-dimensional distributions, such as Gaussian distribution, Dirichlet distribution

[1]

, Logistic Normal distribution

[26], and so on. All these distributions need to predefine their dimensions. In order to avoid this, Gaussian process [27] and Dirichlet process [28] are used to replace former fixed-dimensional distributions because of their infinite properties. Since the data is limited, the learned/used atoms will also be limited even with these ‘infinite’ stochastic processes.

Dirichlet Process can be seen as a distribution over distributions. Since a sample from Dirichlet Process defines a bunch of variables that satisfies Dirichlet distribution, Dirichlet process is a good alternative for the models with Dirichlet distribution as the prior. There are three different methods to construct this process: Black-MacQueen urn schema [29], Chinese restaurant process [30] and stick breaking process [30]. Although the processes that result from them are all Dirichlet processes, they can express different properties of Dirichlet process, such as the posterior distribution from Black-MacQueen urn schema, the clustering from Chinese restaurant process and the formal sampling function from stick breaking process. Based on these constructive processes, a Dirichlet process mixture [31]

is proposed, which is a kind of infinite mixture models. Infinite mixture models are the extension of Finite Mixture Models where there are a finite number of hidden components (topics) used to generate data. Another infinite mixture model is the Infinite Gaussian mixture model

[32]. Normally, a Gaussian mixture model is used for continuous variables and a Dirichlet process mixture is used for discrete variables. An example use for a Dirichlet process is the hierarchical topic model composed by Latent Dirichlet Allocation (LDA) [1] with a nested Chinese restaurant process [33]

. By using a nested Chinese restaurant process as the prior, not only is the number of them not fixed, the topics in this model are also hierarchically organized. In order to learn the Dirichlet process mixture based models with an infinite property, the inference methods should be properly designed. There are two popular and successful methods to do this: Markov Chain Monte Carlo (MCMC)

[34] and variational inference [35].

To summarize, nonparametric learning has been successfully used for extending many models and applied in many real-world applications. However, there is still no work on the nonparametric extension of relational topic models. This paper uses a set of Gamma processes to extend the finite relational topic model to the infinite one.

Iii Nonparametric Relational Topic Model

In this section, we present the proposed Nonparametric Relational Topic (NRT) model in detail. This model can handle the issue that the number of topics needs to be defined.

The proposed model uses a Gamma process to express the interest of a document on infinite hidden topics. A gamma process [36] is a stochastic process, where is a base (shape) measure parameter and is the concentration (scale) parameter. It also corresponds to a complete random measure. Let be a random realization of a Gamma process in the product space . Then, we have

(3)

where is an indicator function,

satisfies an improper gamma distribution with parameters

and . When using to express the document interest, the in Eq. (3) denotes the infinite number of topics and in Eq. (3) denotes the weights of infinite number of topics in a document.

Fig. 2: Illustration of Gamma process assignments for the document network. Each document is assigned a Gamma process which has infinite components (represented by the fences in a document in the figure). Each fence denotes a hidden topic, and some examples are shown in the figure. The length of the fences denote the weights of different topics in a document.

As illustrated in Fig. 2, our idea is to assign each document a gamma process. This assignment should satisfy the following two properties:

Property 1.

Two Gamma processes of two documents with a link should have similar components/topics with each other.

Property 2.

All the Gamma processes of documents should share the same set of components/topics.

In order to achieve the above properties, we firstly generate a global Gamma process,

which is equal to

where is the shared global set of components/topics for documents. We then hope the components of the Gamma process for each document falls within the set of components/topics in the global Gamma process. We use a dependent thinned gamma process to achieve this goal. Its definition is as follow,

Definition 1 (Thinned Gamma Process [37]).

Suppose we have a gamma process and we know there are countably infinite points

. Then, we generate a set of independent binary variables

(). The new process,

(4)

is still a gamma process, which is proofed by [37]. The can be seen as the indicators for the reservation of the point of original/global gamma process, so is called thinned gamma process.

We can give each a Bernoulli prior . Apparently, different realizations of will lead to different gamma processes. Furthermore, the dependence between the different realizations of will also lead to dependence of the generated gamma processes.

For each document, a thinned gamma process is generated with as the global process,

(5)

where is a set of indicators of document on the corresponding components. These

are independent identical distributed random variables with Bernoulli distributions,

(6)

where denotes the probability of the Gamma process of document with component . Therefore, Property 2 is achieved.

In order to make the linked documents have similar Gamma processes, we define a Subsampling Markov Random Field (MRF) [38, 39] to constrain the of all documents,

Definition 2 (Subsampling Markov Random Field).

The subsampling probabilities of all the documents on a component/topic in the global Gamma process have the following constraint,

(7)

where is the document network, is the energy function of MRF and is the normalization part and also called partition function.

Through this subsampling MRF constraint, the marginal distribution of each subsampling probability dependents on the values of its neighbors. Therefore, the of linked documents will be similar, which ensures the proposed NRT achieve Property 1.

Network

Fig. 3: Nonparametric Relational Topic (NRT) Model by dependent thinned Gamma Processes and Markov Random Field (MRF)

To sum up, the proposed Nonparametric Relational Topic (NRT) Model is,

With the for all the documents in hand, the generative procedure of the documents is as follow,

Considering the relationship between the poisson distribution and the multinomial distribution, the likelihood part is equal to,

This form is more convenient for the slice sampling of the model. Note that the

’s are not only with a beta distribution prior but also with a MRF constraint at the same time. We just use this constrain to make the learned

’s satisfy the desired property.

Iv Model Inference

The inference of the proposed (NRT) model is to compute the posterior distribution of latent variables given data,

Here, we use Gibbs sampling method to get samples of this posterior distribution with a truncation (define a relatively big topic number). We also adopt the slice sampling technique [40] to develop an exact sampling without the truncation.

Iv-a Gibbs Sampling

It is difficult to perform posterior inference under infinite mixtures, and a common work-around solution in nonparametric Bayesian learning is to use a truncation method. This method is widely accepted, which uses a relatively big as the (potential) maximum number of topics.

Sampling . Since there are additional constraints for the variables , they do not have a closed-formed posterior distribution.

If ,

(8)

If ,

(9)

Given this conditional distribution of , we can use the efficient A* sampling [41] that is developed recently, because the conditional distribution can be decomposed into two parts: and . The first part is easily sampled using a beta distribution (proposal distribution), and the second part is a bounded function.

Sampling

    1. if ,

      (10)
    2. if ,

      (11)
    3. if ,

      (12)

Finally, we can use a discrete distribution to sample by,

(13)

Sampling

(14)

where

Sampling

(15)

where

Sampling (truncated version)

(16)

where

(17)

Sampling (truncated version)

(18)

where and .

The whole sampling algorithm is summarized in Algorithm 1. Note that the of different are independent of each other given other variables. So the update of of different can be implemented in a parallel fashion.

0:  , a document network with content
0:  , ,
1:  randomly set initial values for , ,
2:   = 1;
3:  while   do
4:     for  each topic  do
5:        for  each document  do
6:           for  each word of document  do
7:              Update by Eq. (16) ;
8:           end for
9:           Update by Eq. (8) or (9) ;
10:           Update by Eq. (13) ;
11:           Update by Eq. (14) ;
12:        end for
13:        Update by Eq. (15) ;
14:        Update by Eq. (18);
15:     end for
16:  end while
Algorithm 1 Truncated Version of Gibbs Sampling for NRT

Iv-B Slice Sampling

Although the truncated method are commonly accepted in the literature, maintaining a large number of components and their parameters is time and space consuming. An elegant idea (call slice sampling [40]) to resolve this problem is to introducing additional variables to adaptively truncate/select the infinite components.

Sampling (slice sampling version) In order to do slice sampling, sample slice variable as,

(19)

where fixed positive decreasing sequence . and

(20)

where

Sampling (slice sampling version) The construction of Gamma process () is,

(21)

where

(22)

The prior of is,

(23)

and the posterior is,

(24)

We can sampling this posterior by two gamma distributions,

(25)

where and . The conditional distribution for the indicator is,

(26)

The second factor is,

(27)

where is the number of items in th Poisson process and .

0:  , a document network with content
0:  , ,
1:  randomly set initial values for , ,
2:   = 1;
3:  while   do
4:     for  each topic  do
5:        for  each document  do
6:           for  each word of document  do
7:              Sample slice variable by Eq. (19) ;
8:              Update by Eq. (20) ;
9:           end for
10:           Update by Eq. (8) or (9) ;
11:           Update by Eq. (13) ;
12:           Update by Eq. (14) ;
13:        end for
14:        Update by Eq. (15) ;
15:        Update by Eq. (25);
16:        Update by Eq. (26);
17:     end for
18:  end while
Algorithm 2 Slice Version of Gibbs Sampling for NRT

Note that the , and are introduced additional variables. They are not in the original model, and their appearances are only for the sampling without the help of the truncation level. The whole slice sampling algorithm is summarized in Algorithm 2.

V Experiments

In this section, we evaluate the effectiveness of the proposed model in learning the hidden topics from document networks. First, we use a small synthetic dataset to demonstrate the model’s ability to recover the number of available topics in the dataset. We then show its usefulness using real-world datasets.

V-a Experiments on synthetic data

We generated synthetic data to explore the NRT s ability to infer the number of hidden topics from the document network. We chose a set of ground truth numbers symbolised by , and that refer to the number of topics, documents and keywords respectively. Then, we generate the global topics by the -dimensional Dirichlet distribution parameterized by where . Next, we generate the document interests on these topics by the -dimensional Dirichlet distribution parameterized by . Now that we have the topics and the document interests on these topics, we can generate each document as follows: For each document , is chosen to be a number between and .

For each word of document , we firstly draw a topic from the document’s interest and then draw a word from the selected topic. Finally, we can obtain a matrix with rows as documents and columns as words, and each entry of this matrix denotes the frequency a particular word in a particular document. The next step is to generate the relations between documents. For each pairs of documents, we compute the inner product between their topic distributions. In order to sparsify these relationships, we only retain the ones where their inner product is greater than 0.2.

Fig. 4: The results of NRT (slice version and truncated version) on synthetic data. The left sub-figures denote the distribution of active topic number from slice version; the right sub-figures denote the distribution of the active topic number from the truncated version. In each sub-figure, the ground-truth of the topic number is given at the top, and the bars represent the frequencies of each possible active topic number.

Here, we adjust values of , and to generate a set of synthetic datasets. The distributions of the learned topic numbers by the proposed algorithms are shown in Fig. 4. The subfiguers in the first column are from the truncated version of the NRT in Algorithm 1, and the subfiguers in the second column are from the slice version of the NRT in Algorithm 2. In each subfigure, the counts of topic numbers from all the iterations (max iteration number is set as 1,000 with 100 burnin) are illustrated in bar charts. Despite the rough initial guess (), we can see that the recovered histogram for

appears to be very similar to the ground truth value with small variance. The sampled

at plotted across all Gibbs iterations, which shows the Markov Chain begin to mix well after around 400 samples.

Fig. 5: The change of active topic number during the iteration.

V-B Experiments on real-world data

The real-world datasets used here are:

  • Cora Dataset222http://linqs.cs.umd.edu/projects/projects/lbc/ The Cora dataset consists of 2708 scientific publications. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

  • Citeseer Dataset The CiteSeer dataset consists of 3312 scientific publications. The citation network consists of 4732 links. Each publication in the dataset is also described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words [42].

For each dataset, we use 5-fold cross validation to evaluate the performance of the proposed model comparing with Relational Topic Model. The whole dataset is equally split into five parts. At each stage, documents in one part are chosen for testing while the rest of the four parts are used for training. We used the implementation of RTM in Eq. (1) from A Fast And Scalable Topic-Modeling Toolbox333http://www.ics.uci.edu/ãsuncion/software/fast.htm#rtm for comparison.

  Datasets  # of documents  # of links  # of words
Cora 2,708 5,429 1,433
Citeseer 3,312 4,732 3,703
TABLE I: Statistics of Datasets

In order to quantitatively compare the proposed model with RTM, two evaluation metrics are designed for both real-world datasets: link prediction and document prediction. The link prediction is used to predict the links between test and training documents using learned topics.

The basic idea is that there will be a link between two documents if they have similar interests on topics. The evaluation equation is,

(28)

where is the number of test documents, is the number of training documents, is the number of word in document , and is 1 if there is a link between and ; 0, otherwise. denotes the learned topic distribution of a training document . is the topic distribution (a K-dimensional vector) of a word , which can be evaluated by,

(29)

where are learned topics. expresses the interest of word on topics and expresses the interest of training document on topics, so their inner product is used to evaluate the probability of their link. We do not consider the normalization here since it does not influence the comparison made between two models on the same dataset, i.e., a “max” operator.

For the word prediction, this basic idea is that a test document has an similar interest on topics with its linked training documents and its words are generated according to its interest. The evaluation equation is,

(30)

where is the number of neighbors that document has.

The results on Cora dataset (5-fold) are shown in Fig. 6, 7, 8, 9, and 10 and the results on Citeseer dataset (5-fold) are shown in Fig. 11, 12, 13, 14, and 15, in which we have compared NRT with RTM under several settings. For clarity, we denote RTM with as “RTMnum”. For example, RTM20 means RTM with .

Note that the slice version of NRT in Algorithm 2 is used as the implementation of NRT. The reason is that slice version is more efficient than truncated version because the slice version does not need to keep the (relatively) large number of hidden topics in memory (the initial guess for the number of topics is normally set as larger than the number of documents).

We notice that our algorithm mixed better than some of the settings in RTM and is generally compatible with the rest. As shown in left subfigures in each group of data, the likelihood by NRT model is generally larger than the RTM under various settings. It means that the proposed model fits or explains the data better than the RTM. As with the synthetic case, we also plot the distribution of . We compared our method with RTM in terms of link and word prediction. In terms of word prediction, our algorithm consistently outperform RTM in every category. In terms of link prediction, NRT’s performance is not universally better than RTM, where we noticed some less accurate results under some RTM settings. We can see that there is a trend for the link prediction with respective to the topic number in RTM. This trend comes from the evaluation equation 28. The RTM with smaller topic number tends to have bigger provability of observed links, which has also been observed in [8]. At an extreme situation (), the RTM reaches its best performance on link prediction. The problem is how to choose the hidden topic number for RTM. Take Cora dataset as an example. The candidates of possible topic number are at least within . However, for the proposed NRT model, the active topic number is automatically learned from the data (for Cora dataset K around ). Without any prior domain knowledge, this topic number can achieve relatively good results on the link prediction considering its large range . In terms of overall result, we argue that in the absence of an accurate domain knowledge of value, the NRT algorithm has allowed us achieving better and more robust performance compared with the current state-of-the-art methods.

Fig. 6: Results of NRT and RTM under different setting () on a first 5-fold of cora dataset. The first subfigure shows the Log-likelihood along iterations; the second subfigure is the learned distribution of active topic number; the third subfigure is the comparison of NRT and RTM on link prediction task; the fourth subfigure is the comparison of NRT and RTM on word prediction task.

Fig. 7: Results of NRT and RTM under different settings () on a second 5-fold of cora dataset.

Fig. 8: Results of NRT and RTM under different settings () on a third 5-fold of cora dataset.

Fig. 9: Results of NRT and RTM under different settings () on a fourth 5-fold of cora dataset.

Fig. 10: Results of NRT and RTM under different settings () on a fifth 5-fold of cora dataset.

Fig. 11: Results of NRT and RTM under different settings () on a first 5-fold of citeseer dataset.

Fig. 12: Results of NRT and RTM under different settings () on a second 5-fold of citeseer dataset.

Fig. 13: Results of NRT and RTM under different settings () on a third 5-fold of citeseer dataset.

Fig. 14: Results of NRT and RTM under different settings () on a fourth 5-fold of citeseer dataset.

Fig. 15: Results of NRT and RTM under different settings () on a fifth 5-fold of citeseer dataset.

Vi Conclusions and future study

Despite of the success of existing relational topic models in discovering hidden topics from document networks, they are based on the unrealistic assumption, for many real-world applications, that the number of topics can be easily predefined. In order to relax this assumption, we have presented a nonparametric relational topic model. In our proposed model, the stochastic processes are adopted to replace the fixed-dimensional probability distributions used by existing relational topic models which lead to the necessity of pre-defining the number of topics. At the same time, introducing stochastic processes leads to the difficulty with model inference, and we have therefore also presented truncated Gibbs and slice sampling algorithms for the proposed model. Experiments on both the synthetic dataset and the real-world dataset have demonstrated our method’s ability to inference the hidden topics and their number.

In the future, we are interested in making the sampling algorithm scalable to large networks by using new network constrain methods instead of MRFs. Current MRF-based methods do not make the inference efficient enough. We believe that the network constraint methods can avoid this issue.

Acknowledgments

Research work reported in this paper was partly supported by the Australian Research Council (ARC) under discovery grant DP140101366 and the China Scholarship Council. This work was jointly supported by the National Science Foundation of China under grant no.61471232.

References

  • [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning Research, vol. 3, pp. 993–1022, 2003.
  • [2] D. M. Blei, “Probabilistic topic models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012. [Online]. Available: http://doi.acm.org/10.1145/2133806.2133826
  • [3] Z. Guo, Z. Zhang, S. Zhu, Y. Chi, and Y. Gong, “A two-level topic model towards knowledge discovery from citation networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 4, pp. 780–794, April 2014.
  • [4] B. Klimt and Y. Yang, “The enron corpus: A new dataset for email classification research,” in Machine learning: ECML 2004.   Springer, 2004, pp. 217–226.
  • [5] H. W. Park, “Hyperlink network analysis: A new method for the study of social structure on the web,” Connections, vol. 25, no. 1, pp. 49–61, 2003.
  • [6] J. Chang and D. M. Blei, “Relational topic models for document networks,” in AISTATS, 2009, pp. 81–88.
  • [7] J. Chang, D. M. Blei et al., “Hierarchical relational models for document networks,” The Annals of Applied Statistics, vol. 4, no. 1, pp. 124–150, 2010.
  • [8] N. Chen, J. Zhu, F. Xia, and B. Zhang, “Discriminative relational topic models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2014.
  • [9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, 2006.
  • [10] A. McCallum, X. Wang, and A. Corrada-Emmanuel, “Topic and role discovery in social networks with experiments on enron and academic email.”

    Journal of Artificial Intelligence Research

    , vol. 30, pp. 249–272, 2007.
  • [11] Y. Cha and J. Cho, “Social-network analysis using topic models,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’12.   New York, NY, USA: ACM, 2012, pp. 565–574.
  • [12] V. Tuulos and H. Tirri, “Combining topic models and social networks for chat data mining,” in Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, Sept 2004, pp. 206–213.
  • [13] E. Wang, J. Silva, R. Willett, and L. Carin, “Dynamic relational topic model for social network analysis with noisy links,” in IEEE Statistical Signal Processing Workshop (SSP), June 2011, pp. 497–500.
  • [14] Y. Cha, B. Bi, C.-C. Hsieh, and J. Cho, “Incorporating popularity in topic models for social network analysis,” in Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’13.   New York, NY, USA: ACM, 2013, pp. 223–232.
  • [15] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in Proceedings of the 17th International Conference on World Wide Web, ser. WWW ’08.   New York, NY, USA: ACM, 2008, pp. 101–110.
  • [16] N. Pathak, C. DeLong, A. Banerjee, and K. Erickson, “Social topic models for community extraction,” Tech. Rep., 2008.
  • [17] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed membership stochastic blockmodels,” Journal of Machine Learning Research, vol. 9, pp. 1981–2014, Jun. 2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1390681.1442798
  • [18] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’08.   New York, NY, USA: ACM, 2008, pp. 542–550.
  • [19] Y. Zhu, X. Yan, L. Getoor, and C. Moore, “Scalable text and link analysis with mixed-topic link models,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’13.   New York, NY, USA: ACM, 2013, pp. 473–481.
  • [20] L. Dietz, S. Bickel, and T. Scheffer, “Unsupervised prediction of citation influences,” in Proceedings of the 24th International Conference on Machine Learning, ser. ICML ’07.   New York, NY, USA: ACM, 2007, pp. 233–240.
  • [21] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and L. Giles, “Detecting topic evolution in scientific literature: How can citations help?” in Proceedings of the 18th ACM Conference on Information and Knowledge Management, ser. CIKM ’09.   New York, NY, USA: ACM, 2009, pp. 957–966.
  • [22] Y. Sun, J. Han, J. Gao, and Y. Yu, “itopicmodel: Information network-integrated topic modeling,” in The Ninth IEEE International Conference on Data Mining, Dec 2009, pp. 493–502.
  • [23] Y. Liu, A. Niculescu-Mizil, and W. Gryc, “Topic-link lda: Joint models of topic and author community,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09.   New York, NY, USA: ACM, 2009, pp. 665–672.
  • [24] P. McCullagh, “Generalized linear models,” European Journal of Operational Research, vol. 16, no. 3, pp. 285–292, 1984.
  • [25] S. J. Gershman and D. M. Blei, “A tutorial on bayesian nonparametric models,” Journal of Mathematical Psychology, vol. 56, no. 1, pp. 1–12, 2012.
  • [26] D. M. Blei and J. D. Lafferty, “A correlated topic model of science,” The Annals of Applied Statistics, pp. 17–35, 2007.
  • [27] M. Seeger, “Gaussian processes for machine learning,” International Journal of Neural Systems, vol. 14, no. 02, pp. 69–106, 2004.
  • [28] S. Ghosal, The Dirichlet process, related priors and posterior asymptotics.   Chapter, 2010, vol. 2.
  • [29] D. Blackwell and J. B. MacQueen, “Ferguson distributions via polya urn schemes,” The Annals of Statistics, pp. 353–355, 1973.
  • [30] Y. W. Teh, “Dirichlet process,” in Encyclopedia of machine learning.   Springer, 2010, pp. 280–287.
  • [31] C. E. Antoniak, “Mixtures of dirichlet processes with applications to bayesian nonparametric problems,” The Annals of Statistics, pp. 1152–1174, 1974.
  • [32] C. E. Rasmussen, “The infinite gaussian mixture model,” in Advances in Neural Information Processing Systems 12, S. Solla, T. Leen, and K. Müller, Eds.   MIT Press, 2000, pp. 554–560. [Online]. Available: http://papers.nips.cc/paper/1745-the-infinite-gaussian-mixture-model.pdf
  • [33] T. Griffiths, M. Jordan, J. Tenenbaum, and D. M. Blei, “Hierarchical topic models and the nested chinese restaurant process,” Advances in Neural Information Processing Systems, vol. 16, pp. 106–114, 2004.
  • [34] R. M. Neal, “Markov chain sampling methods for dirichlet process mixture models,” Journal of Computational and Graphical Statistics, vol. 9, no. 2, pp. 249–265, 2000.
  • [35] L. Carin, D. M. Blei, and J. W. Paisley, “Variational inference for stick-breaking beta process priors,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 889–896.
  • [36] A. Roychowdhury and B. Kulis, “Gamma processes,stick-breaking, and variational inference,” arXiv preprint arXiv:1410.1068, 2014.
  • [37] N. J. Foti, J. D. Futoma, D. N. Rockmore, and S. Williamson, “A unifying representation for a class of dependent random measures,” in AISTATS, 2013, pp. 20–28.
  • [38] R. Kindermann, J. L. Snell et al., Markov random fields and their applications.   American Mathematical Society Providence, RI, 1980, vol. 1.
  • [39] S. Z. Li,

    Markov random field modeling in computer vision

    .   Springer-Verlag New York, Inc., 1995.
  • [40] R. M. Neal, “Slice sampling,” Annals of Statistics, pp. 705–741, 2003.
  • [41] C. J. Maddison, D. Tarlow, and T. Minka, “A* sampling,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 3086–3094. [Online]. Available: http://papers.nips.cc/paper/5449-a-sampling.pdf
  • [42] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–106, 2008.