1 Introduction
Probabilistic models to infer the interests of authors have attracted much interest throughout the language modeling community, with the AuthorTopic model [4] as one of its seminal representatives. Multiple modifications to the AuthorTopic model have been proposed. These modifications assume either a fixed number of topics or focus on using authorship information as an additional feature in a nonparametric setting with only little resemblance to the structure of the original work. This article addresses a complementary problem – representing the AuthorTopic model in the framework of Bayesian nonparametrics but keeping as much as possible of its original structure. While this might be valuable in its own right, it is also useful in a more general sense since the steps necessary to transform an extension of Latent Dirichlet Allocation (LDA) with a fixed number of parameters to an equivalent model that grows the number of parameters with the amount data available apply to a broad range of models.
2 Generative models for documents and authors
We will describe two different models: The first one relates authors and documents via a fixed number of topics, and the second one models the interests of authors using a flexible number of topics. Both models are described by using the common notation of a document d being a vector of
words, , where all are chosen from a vocabulary with V terms, and are the authors of document d chosen from the set of all authors of size J. A corpus of D documents is then defined by the set .



a)  b) 
2.1 The parametric model
The seminal AuthorTopic model [4] has two sets of unknown parameters; J distributions over topics conditioned on the authors, and K distributions over terms conditioned on the topics  as well as the assignments of individual words to authors and topics . With and being integrated out a collapsed gibbs sampler is used, analogous to [1], to converge to the true underlying distributions of the Markov state variables x and z. The transitions between the states of the chain result from iteratively sampling each pair as a block, conditioned on all other variables:
(1) 
where is the number of times a word of the topic k has been assigned to the author j excluding the current instance, and is used in place of a variable to indicate that the sum over its values (e.g. ) is taken. The assignment variable represents the topic of the word in document d being k as represents the assignment to author j. The term on the right side is the posterior density of term t under topic k:
(2) 
where is the number of times a term t has been assigned to topic k again excluding the current word from the count.
2.2 The nonparametric model
One frequently raised question when applying the AuthorTopic model to a new data set, is how to choose the number of topics [5]. The Bayesian nonparametric framework of the Hierarchical Dirichlet Process (HDP) [6] offers an elegant solution to this by allowing a prior over a countably infinite number of topics of which only a few will dominate the posterior. Building on the finite version of the model we split the symmetric prior over topics into a scalar precision parameter and a distribution . Taking this to the limit we get the root distribution for the nonparametric AuthorTopic model (fig.1b). Analogously to the collapsed gibbs sampler for the previous LDA version we integrate over , but keep
as an auxiliary variable to preserve the structure of the state transition probabilities in the finite case for the HDP
[2].(3) 
With being the prior density of a word under a new topic [6]. The key difference between these equations and the original model (1) is that we now have a root distribution for the HDP over K+1 possible states. If there are K topics in the current step, then represents the accumulated continuous probability mass of all possible but currently unused topics, allowing to choose a new one from a countably infinite pool of empty topics. If the count for number of words assigned to a topic goes to zero, the topic is returned to the pool of unused topics.
2.3 Sampling the Root Distribution
However, the construction of a Markov chain for the nonparametric AuthorTopic model requires that additionally the root distribution of the Dirichlet processes must be sampled which was not present in the finite version of the model. The discrete part of the root distribution guarantees that existing topics are reused with probability and the continuous part allows for a new topic to be sampled with probability [3]. Given the Markov state we begin by generating J vectors
(4) 
where
are the number of words for author j which have been assigned to topic k. Next, we draw Bernoulli random variables
. The posterior of the toplevel Dirichlet process is then sampled via(5) 
making a discrete distribution over K used topics plus one component with the probability mass of the infinite possible, yet unused topics.
3 Discussion
In this work, we transformed the LDA based AuthorTopic model into a nonparametric model that estimates the number of components necessary for representing the data. Yet, it will be necessary to empirically evaluate performance (i.e. perplexity) of the proposed model on benchmark data sets. While choosing the AuthorTopic model as an example for such a transformation, we believe that many of the considerations made equally hold for a wider range of models and can serve as a blueprint for a simple application of nonparametric Bayesian priors.
References
 Griffiths and Steyvers [2004] Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(1), 5228–5235 (2004)
 Heinrich [2011] Heinrich, G.: “Infinite LDA” – Implementing the HDP with minimum code complexity. Technical report, arbylon.net (2011)
 Porteous [2010] Porteous, I.: Mixture Block Methods for Non Parametric Bayesian Models with Applications. Ph.D. thesis (2010)

RosenZvi et al. [2004]
RosenZvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The AuthorTopic
Model for Authors and Documents.
In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004)
 Sugimoto et al. [2011] Sugimoto, C., Li, D., Russell, T., Finlay, S., Ding, Y.: The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology 62(1), 185–204 (2011)
 Teh et al. [2006] Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)
Comments
There are no comments yet.