1 Introduction
Traditional text mining algorithms only model the text corpus with two levels: documentword. Topic models are commonly regarded as the efficient tools for the text mining by learning the hidden topics [15]. Recently, interests have been paid on the side information of the text corpus, which includes the conferences of the papers [21], time stamps [24], authors [18, 20], entities [9], emotion tags [1] and other labels [28]. The incorporation of these side information into the classical topic models benefits a lot of realworld tasks. Among them, Author Topic Model (ATM) [18, 20, 17] is proposed by adding a set of variables to the original topic model aiming to indicate and inference the interests of authors together with the hidden topics.
The ability to jointly learn the hidden topics and authors’ interests on these topics has a variety of application scenarios. For example, 1) an academic recommendation system can recommend authors and/or papers with similar research interests to that of the input author; 2) detecting the most and least surprising papers for an author [20]; 3) in an authortopicbased paper browser, a set of papers can be ranked according to authors and topics; 4) authors disambiguation [26].
One drawback of the existing author topic model is that the number of hidden topics needs to be fixed in advance. This number is normally chosen with domain knowledge. By fixing the number of topics, ATM can then adopt Dirichlet and Multinomial distributions with the predefined dimension. However, limiting each document to have exactly fixed number of topics is apparently unrealistic for many realworld applications. In this paper, we propose an infinite author topic (IAT) model to relax this assumption. Instead of using fixeddimensional distributions, stochastic processes are used: to be specific, the gammanegative binomial process [27]
is extended to three levels for capturing the hierarchical structure: authordocumentkeyword. In this model, each document is assigned with a gamma process to express the interest of this document on the hidden topics instead of a vector with a fixed dimension. This gamma process can be simply considered as an infinite discrete distribution, and is parameterized by a base measure (another gamma process) that denotes the interest of the author of this document on the hidden topics. However, a document normally has multiple authors, so we assign a document a mixed gamma process that is based on all the gamma processes of the authors of this document. Furthermore, introducing mixed gamma process will lead to intricacies in terms of model inference. Therefore, an efficient Gibbs sampling with closedform conditional distributions is developed for the proposed model. Experiments on the two realworld datasets show the capability of our model to learn both the hidden topics and the number of topics, simultaneously.
The main contributions of this paper are,

propose a new nonparametric Bayesian model to relax the fixed topic number assumption of the traditional author topic models;

design an efficient Gibbs sampling inference algorithm for getting the solution of the proposed model.
The rest paper is structured as follows. Section 2 briefly introduces the related work. Section 3 describes some preliminary knowledge. The IAT model is proposed and presented in Section 4 with its Gibbs sampling inference algorithm. Section 5 describes the IAT model experimental results using realworld datasets. Finally, Section 6 concludes this study with a discussion on future directions.
2 Related Work
In this section, we briefly review the related work of this study. The first part is about the topic models, and the second part is about nonparametric Bayesian learning.
2.1 Topic Models
Topic models [2]
are Bayesian models with fixeddimensional probability distributions. They are originally designed for the text mining task, which aim to discover the hidden topics in the text corpus to assist document clustering or classification. Due to their good extendibility and powerful representation, they have been successfully applied to many research areas, including analysis in image
[12], video [10], genetics [5] and music [13]. Among these extensions, author topic models [18, 20, 17] were proposed to infer the hidden topics and author interests. The documents are supposed to be generated by its authors according to their interests over the hidden topics. This model will be explained with more details in Section 3.ATM has attracted a lot of attentions from researchers working in the text mining area, because it provides an elegant way to incorporate the side (in this case, author) information of the documents for topic learning. This model can be extend to incorporate other side information of text corpus, such as emotional tags [1], conferences[21] and time stamps [24].
2.2 Nonparametric Bayesian Learning
Nonparametric Bayesian learning is a key approach for learning the number of mixtures in a mixture model (also called model selection problem). Without predefining the number of mixtures, this number is supposed to be inferred from the data, i.e., let the data speak.
The idea of nonparametric Bayesian learning is to use the stochastic processes to replace the traditional fixeddimensional probability distributions, such as Multinomial, Poisson, and Dirichlet. In order to avoid the limitation associated with fixed dimensions, Multinomial Process (MP), Poisson Process (PP) [8] and Dirichlet Process (DP) [22] are used to replace former distributions because of their infinite properties.
The merit of these stochastic processes is that they let the data to determine the number of factors (in text mining task, topics). DP is a good alternative for the models with Dirichlet distribution as the prior. Many probabilistic models with fixed dimensions have been extended to the infinite ones by the help of stochastic processes: Gaussian Mixture Model (GMM) is extended to Infinite Gaussian Mixture Model (IGMM)
[16]using DP; Hidden Markov Model is extended with infinite number of hidden states using Hierarchial Dirichlet Process
[23, 7]. Through the posterior inference (i.e., Markov chain Monte Carlo (MCMC)
[11]), the number of the mixtures can be inferred. Other popular processes include beta process, gamma process, poisson process, multinomial process, negative binomial process (NBP) [27, 3] have also been used in the machine learning communities recently.To summarize, nonparametric Bayesian learning [4] has been successfully used to extend many finite models and applied to many realworld applications. However, to the best of our knowledge, there has not been any works proposed to use NBP for author topic modelling. This paper addresses this shortcoming by proposing a mixed gamma negative binomial process to extend the finite author topic model to the infinite one.
3 Preliminary Knowledge
This section briefly introduces the related models which will be used in the rest of sections.
3.1 Author Topic Model
The Author Topic Model [18, 20, 17] aims to learn the hidden topics from the papers and more importantly learn the authors’ interests on these topics. Based on the classical LDA [2], a set of new variables are introduced to indicate the authors’ interests. The graphical representation of the model is shown in Fig. 1, and the generative procedure is as follows,
(1)  
where denote the authors’ interests on the topics and denotes the authors of a document. We can see from the Eq.(1) that the ATM is constructed by the fixeddimensional probability distributions. One issue of this model is that the number of topics needs to be predefined, because the dimensions of the probability distributions need to be predefined. However, it is very difficult and not appropriate to predefine the topic number in many realworld scenarios.
3.2 Gamma Negative Binomial Process
3.2.1 Gamma Process
A gamma process [19] is a stochastic process, where is a base (shape) measure and is the concentration (scale) parameter. It also corresponds to a complete random measure. Let be a random realization of a Gamma process in the product space . Then, we have
(2)  
where is an indicator function,
satisfies an improper gamma distribution
, and . After the normalization of the , we can get the famous Dirichlet process [22].3.2.2 Negative Binomial Process
A negative binomial process [27] is also a stochastic process parameterized by a base measure and . Similar with the gamma process, a realization of negative binomial process is also a set of points in product space . Then, we have
(3)  
where are integers, so negative binomial process is normally used for the counting model [3]
. Compared with Poisson process which is also suitable for the counting model, negative binomial process has a better variancetomean ratio (VMR) and the overdispersion level
[27].3.2.3 GammaNegative Binomial Process
Normally, negative binomial process is used as the likelihood part of a Bayesian model. Like a negative binomial distribution
which has two parameters: and , there are two kinds of priors for a negative binomial process: one is Gamma process [27] as shown in Eq. (3); the other is the Beta process [3] as . In this paper, we use the Gamma process prior. A gammanegative binomial processbased topic model is proposed in [27] as shown in Fig. 2 and it can be represented as,(4)  
where the base measure of the negative binomial process is a random measure from a gamma process. is for each document, and this hierarchial form makes the documents share a same base measure . This gammanegative binomial process can be equivalently augmented as gammagammapoisson process,
(5)  
where is a Poisson process with parameter . This augmentation, which is useful for the closeform model inference algorithm design, is equal to gammanegative binomial process model in distribution.
Notation  description 

number of documents  
number of authors  
number of words  
authordocument mapping matrix  
documentword mapping matrix  
number of authors of document 
In this paper, we will build an infinite author topic model based on this gammanegative binomial process model.
4 Infinite Author Topic Model
In this section, we first propose our infinite author topic (IAT) model, and then introduce its Gibbs sampling strategy to inference the proposed model.
4.1 Model Description
Consider the gammanegative binomial process topic model in Eqs. (4) and (5) again: despite its successful, this model however is fundamentally the same as the basic topic models, which are used for modeling the data of two level hierarchy: documentkeyword. Our aim is to extend topic model into threelevel hierarchy: authordocumentkeyword. So we add another gamma process level to capture the additional (author) level based on the gammanegative binomial process topic model in Eq.(5) analogues to the hierarchical form of Hieratical Dirichlet Process [23],
(6)  
where is the new added level for the authors. We call this model threelevel gammanegative binomial process topic model (3GNB), which is graphically shown in the left subfigure of Fig. 3. However, there is a problem in 3GNB that it requires each document with only one author.
In the 3GNB model, each document is assigned a realization of gamma process,
(7) 
where denotes the th topic and is the weight of th topic. can be viewed as the interest of document on the topics. The number of topics can potentially be infinite and therefore justifies the infinity in the summation. However, since the data is limited, the learned topics will be also limited. Similar to the document, each author is also assigned a realization of gamma process,
(8) 
where is the weight of interests of author on the topics. In the 3GNB model, the base measure for a is from its author . It can be seen as the ‘interest inheritance’.
In order to model in the setting where a document is with multiple authors, we combine all the gamma processes of every authors of a document together by
(9) 
where is the number of authors of document , is the convex combination (each gamma process is with same weight in this paper) and is the mixed prior for . We can see the mixed gamma process as the mixed interests of all the authors of a document. Then, the revised model is as follow
and the graphical representation is shown in Fig. 3. Some frequently used notations are explained in Table 1.
4.2 Model Inference
It is difficult to perform posterior inference under infinite mixtures, a common workaround solution in nonparametric Bayesian learning is to use a truncation method. Truncation method is widely accepted, which uses a relatively big as the (potential) maximum number of topics. Under the truncation, the model can be expressed below as a good approximation to the infinite model,
where is the total mass of measure , and the parameters are given the appropriate priors. Here, is a dimensional Dirichlet distribution, and each is a topic that is a dimensional vector.
The difficult part of the inference for this model is the mixed part or . Since is the mixed value, it is hard to infer the posterior of through its likelihood. In order to resolve this issue, we firstly introduce the Additive Property of the negative binomial distribution,
Theorem 1
If follows a negative binomial distribution with parameters and and if the various are independent, then follows a negative binomial distribution with parameters and .
In the model, we have
(10)  
(in distribution) equal to
(11) 
and according to THEOREM 1, it is further (in distribution) equal to
(12)  
where is the number of authors in document .
We have split the number of words assigned to topic in document into a number of independent variables . Here, denotes the number of words assigned to topic from author in document . From Eq.(12), we can see that we have the likelihood part of the , so we can update/inference the using . Introducing the auxiliary variables helps us resolve the difficult inference problem brought by the mixed gamma process. Note that the independence between the elements of is very important, which facilitates us update each independently.
According to the relationship between the negative binomial distribution and the gammapoisson distribution, for each
, we have:(13)  
We want to highlight that is different from : is the mixed Gamma process of multiple author Gamma processes of Gamma process of document and is the interest of document on topic inherited from author .
Due to the nonconjugacy of gamma distribution and negative binomial distribution, it is difficult to update with a gamma prior. In order to make the inference with only closeformed conditional distributions, we use the following results on the negative binomial process,
Theorem 2
where CRT denotes Chinese restaurant Table distribution.
With THEOREM 2, the Eq. (13) is also equal to
(16)  
Finally, we can update all by,
(17)  
and for each word in a document , we can assign it to a topic and author by
(18)  
With these changes of variables, the original model is reformulated as,
(19)  
In the following, a Gibbs sampling algorithm is designed for the posterior inference and all the conditional distributions are listed.
Sampling
(20) 
Sampling
(21) 
where is the number of words in document with author and topic .
Sampling
(22) 
Sampling
(23)  
Sampling
(24)  
Sampling
(25) 
Sampling
(26) 
where
(27) 
Sampling
(28) 
Sampling
(29) 
where
(30) 
Sampling
(31) 
We can see from these conditional distributions that all of them are closedform which is very easy to updated and implemented. Note that the sampling of the CRT distribution can be found in [27]. The whole procedure is summarized in Algorithm 1.
Note that after we obtain all the samples of the posterior of latent variables and remove the burnin stage, we firstly identify the topic number with largest frequency as the , and then find the sample with largest likelihood and from these samples. The output of Gibbs sampler are the latent variables , and in this sample.
5 Experiments
Datasets  D  A  N 

NIPS  1,740  2,037  13,649 
DBLP  28,569  28,702  11,771 
D training  D test  A  N  

group 1  1,072  319  1,115  3,783 
group 2  1,071  316  1,094  3,782 
group 3  1,075  305  1,071  3,788 
group 4  1,076  339  1,104  3,823 
group 5  1,079  310  1,111  3,841 
D training  D test  A  N  

group 1  1,503  237  2,037  5,110 
group 2  1,495  245  2,037  5,110 
group 3  1,511  229  2,037  5,110 
In this section, we evaluate the proposed infinite author topic model (IAT), and compare it with the finite authortopic model (ATM) on different datasets.
5.1 Datasets
Two public datasets used in this paper are:

NIPS papers^{1}^{1}1http://www.datalab.uci.edu/authortopic/NIPs.htm This dataset contains papers from the NIPS conferences between 1987 and 1999. More description can be found in the [20];

DBLP papers^{2}^{2}2http://www.cs.uiuc.edu/ hbdeng/data/kdd2011.htm
The abstracts and authors of papers are extracted through DBLP interface from four areas: database, data mining, information retrieval and artificial intelligence. More description can be found in the
[6].
Some statistics of two datasets are shown in Table 2. For each dataset, we randomly select some documents as training data and test data. The Table 4 and Table 3 show the selection results on two datasets. The number of selected training and test documents are specialized in column D training and column D test in Table 4 and 3. The requirements of selections is: the training and test documents must share some authors and some words. This requirement makes sure the learned topics and authors’ interests can be used to predict the test documents.
5.2 Evaluation Metrics
In order to evaluate the performance of the proposed model, we calculate the perplexity of the test documents using the learned topics and author interests on these topics. Perplexity is widely used in language modeling to assess the predictive power of a model [20, 2]. It is a measure of how surprising the words in the test documents are from the model’s perspective. It can be computed as,
(32)  
where is the authors of test document . The smaller the value of perplexity is, the better the predictive ability of a model has. Since we use the same test documents for different models, the normalization is not considered because it does not influence the model comparisons.
Another evaluation metric is the training data likelihood,
(33) 
This is a measure of the probability of the training document under the learned latent variables , and . It can be understood as ‘how the model fits the training data’. The bigger the value of likelihood is, the better a model fits the training data.
5.3 Results Analysis
For the DBLP dataset, the results are all shown in Fig. 4. Each row of the Fig. 4 denotes a group of DBLP dataset corresponding to Table 3. The left subfigures show the comparison on the data loglikelihood. Here, we adjust different active topic numbers for the ATM, including , , , and . From these subfigures, the proposed IAT model (The hyperparameters are set as following by experiences for the rest of this section: , , , , and ) outperforms the ATM on different preset topic numbers. It means that IAT fits the training documents better than the ATM, and, more importantly, IAT does not depend the domain knowledge to predefine the active topic number, making the method widely applicable.
The middle subfigures in Fig. 4 indicate the changing of active topics during the iteration of the IAT (The number of active topics is set as the number of training documents at the initialization step of the model). These curves show that the number of active topics dramatically drops down at the burnin stage of the sampling, and began to stabilize after about 200 iterations. Since the documents are different in content but similar in numbers amongst the groups, the learned topic number is differ slightly amongst each others. These numbers are: group 1: ; group 2: ; group 3: ; group 4: ; group 5: .
In order to show the effectiveness of the proposed model, we also compare the performances of two models (IAT and ATM) on the test documents prediction using perplexity in Eq. (32). Since the training and test documents share some authors, we can compute the perplexity of the test documents according to the learned topics and authors’ interests on them. At each step of iterations, the perplexity of test documents is computed using the latent variables, , and , at this iteration. The results are shown in right subfigures of Fig. 4. In each subfigure, the first bar denotes the mean of perplexities of all iterations except the burnin stage (
iterations) of the proposed model IAT and the others denote ATM with different (predefined) topic numbers. The standard deviations are also shown in the subfigures. The proposed model gets the best performance (smallest perplexity). The standard deviation of IAT is relatively bigger than ATM. The reason is because the number of active topics will change during the iteration but it will not change in ATM, so in theory, the randomwalk space of Gibbs sampler of IAT can be larger than that of ATM. Even with this relatively larger standard deviation, the mean of perplexity of IAT is smaller than ATM.
For the NIPS dataset, the results are all shown in Fig. 5. Same with the DBLP dataset, the log likelihoods of IAT and ATM with different predefined active topic numbers are shown in the left side of the Fig. 5. Unsurprisingly, the subfiguers in the middle column show the convergence of IAT (group 1: 367; group 2: 529; group 3: 354). Specially, we found that the loglikelihoods of ATM increases when topic number decreases. Therefore, we have compared with ATM with only two (the minimum number) topics as shown in the left subfigures in Fig. 5. It can be seen that the proposed IAT model also gets larger log likelihood and smaller perpetuity when compared with ATM except the case where ATM is set to have 10 topics in group 2. Even so, the ATM in group 2 with 10 topics has almost same performance with IAT on the Loglikelihood of training documents. Moreover, we can see that it takes 800 iterations to reach this stability for the ATM with 10 topics, but IAT only takes less than 50 iterations to reach the same stability.
6 Conclusions and Further Study
We have developed an infinite author topic model that can automatically learn completely the latent features of the authordocumentkeywords hierarchy, which include hidden topics, authors’ interests on these topics and the number of topic from text corpora. The stochastic processes are adopted instead of the fixeddimensional probability distributions. The model uses a mixed author gamma process as the base measure of the document gamma process to capture the authordocument mapping. We have demonstrated that the designed Gibbs sampling algorithm can be used to learn such infinite author topic model based on the various realworld datasets.
Other potential applications of this work include multilabel learning [25]
: The ‘authors’ in the proposed model can be seen as labels, and the inference of the model can be seen as the training of the multilabel classifier. The learned topics can be seen as having infinite features space. This is our further study.
7 Acknowledgments
Research work reported in this paper was partly supported by the Australian Research Council (ARC) under discovery grant DP140101366 and the China Scholarship Council. This work was jointly supported by the National Science Foundation of China under grant no.61471232.
References
 [1] S. Bao, S. Xu, L. Zhang, R. Yan, Z. Su, D. Han, and Y. Yu. Mining social emotions from affective text. IEEE Transactions on Knowledge and Data Engineering, 24(9):1658–1670, Sept. 2012.
 [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
 [3] T. Broderick, L. Mackey, J. Paisley, and M. Jordan. Combinatorial clustering and the beta negative binomial process. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2014.
 [4] W. L. Buntine and S. Mishra. Experiments with nonparametric topic models. KDD ’14, pages 881–890, New York, NY, USA, 2014. ACM.
 [5] X. Chen, X. Hu, T. Y. Lim, X. Shen, E. K. Park, and G. L. Rosen. Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4):980–991, July 2012.
 [6] H. Deng, J. Han, B. Zhao, Y. Yu, and C. X. Lin. Probabilistic topic models with biased propagation on heterogeneous information networks. KDD ’11, pages 1271–1279, New York, NY, USA, 2011. ACM.
 [7] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. A sticky HDPHMM with application to speaker diarization. Annals of Applied Statistics, 5(2A):1020–1056, 2011.
 [8] T. Iwata, A. Shah, and Z. Ghahramani. Discovering latent influence in online social activities via shared cascade poisson processes. KDD ’13, pages 266–274, New York, NY, USA, 2013. ACM.
 [9] H. Kim, Y. Sun, J. Hockenmaier, and J. Han. Etm: Entity topic models for mining documents associated with entities. ICDM ’12, pages 349–358, Washington, DC, USA, 2012. IEEE Computer Society.
 [10] L. Liu, L. Sun, Y. Rui, Y. Shi, and S. Yang. Web video topic discovery and tracking via bipartite graph reinforcement model. WWW ’08, pages 1009–1018, New York, NY, USA, 2008. ACM.
 [11] R. M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
 [12] C.T. Nguyen, N. Kaothanthong, T. Tokuyama, and X.H. Phan. A featurewordtopic model for image annotation and retrieval. ACM Transactions on Web, 7(3):12:1–12:24, Sept. 2013.
 [13] A. Pinto and G. Haus. A novel xml music information retrieval method using graph invariants. ACM Transactions on Information Systems, 25(4), Oct. 2007.
 [14] M. H. Quenouille. A relation between the logarithmic, poisson, and negative binomial series. Biometrics, 5(2):162–164, 1949.
 [15] D. Ramage, C. D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining. KDD ’11, pages 457–465, New York, NY, USA, 2011. ACM.
 [16] C. E. Rasmussen. The infinite gaussian mixture model. NIPS ’12, pages 554–560, 1999.
 [17] M. RosenZvi, C. Chemudugunta, T. Griffiths, P. Smyth, and M. Steyvers. Learning authortopic models from text corpora. ACM Transactions on Information Systems, 28(1):4:1–4:38, Jan. 2010.
 [18] M. RosenZvi, T. Griffiths, M. Steyvers, and P. Smyth. The authortopic model for authors and documents. UAI ’04, pages 487–494, Arlington, Virginia, United States, 2004. AUAI Press.
 [19] A. Roychowdhury and B. Kulis. Gamma processes,stickbreaking, and variational inference. arXiv preprint arXiv:1410.1068, 2014.
 [20] M. Steyvers, P. Smyth, M. RosenZvi, and T. Griffiths. Probabilistic authortopic models for information discovery. KDD ’04, pages 306–315, New York, NY, USA, 2004. ACM.
 [21] J. Tang, R. Jin, and J. Zhang. A topic modeling approach and its integration into the random walk framework for academic search. ICDM ’08, pages 1055–1060, Dec 2008.
 [22] Y. W. Teh. Dirichlet process. In Encyclopedia of machine learning, pages 280–287. Springer, 2010.
 [23] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
 [24] X. Wang and A. McCallum. Topics over time: A nonmarkov continuoustime model of topical trends. KDD ’06, pages 424–433, New York, NY, USA, 2006. ACM.
 [25] M.L. Zhang and Z.H. Zhou. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, Aug 2014.
 [26] J. Zhao, P. Wang, and K. Huang. A semisupervised approach for author disambiguation in kdd cup 2013. KDD Cup ’13, pages 10:1–10:8, New York, NY, USA, 2013. ACM.
 [27] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2013.
 [28] J. Zhu, A. Ahmed, and E. P. Xing. Medlda: Maximum margin supervised topic models. Journal of Machine Learning Research, 13(1):2237–2278, Aug. 2012.