A simple non-parametric Topic Mixture for Authors and Documents

11/27/2012
by   Arnim Bleier, et al.
0

This article reviews the Author-Topic Model and presents a new non-parametric extension based on the Hierarchical Dirichlet Process. The extension is especially suitable when no prior information about the number of components necessary is available. A blocked Gibbs sampler is described and focus put on staying as close as possible to the original model with only the minimum of theoretical and implementation overhead necessary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/26/2015

Nested Hierarchical Dirichlet Processes for Multi-Level Non-Parametric Admixture Modeling

Dirichlet Process(DP) is a Bayesian non-parametric prior for infinite mi...
09/22/2016

Bibliographic Analysis with the Citation Network Topic Model

Bibliographic analysis considers author's research areas, the citation n...
08/22/2012

A non-parametric mixture model for topic modeling over time

A single, stationary topic model such as latent Dirichlet allocation is ...
01/20/2016

Hierarchical Latent Word Clustering

This paper presents a new Bayesian non-parametric model by extending the...
06/27/2012

A Non-Parametric Bayesian Method for Inferring Hidden Causes

We present a non-parametric Bayesian approach to structure learning with...
11/30/2020

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Authorial clustering involves the grouping of documents written by the s...
04/26/2021

Efficient Evolutionary Models with Digraphons

We present two main contributions which help us in leveraging the theory...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic models to infer the interests of authors have attracted much interest throughout the language modeling community, with the Author-Topic model [4] as one of its seminal representatives. Multiple modifications to the Author-Topic model have been proposed. These modifications assume either a fixed number of topics or focus on using authorship information as an additional feature in a non-parametric setting with only little resemblance to the structure of the original work. This article addresses a complementary problem – representing the Author-Topic model in the framework of Bayesian non-parametrics but keeping as much as possible of its original structure. While this might be valuable in its own right, it is also useful in a more general sense since the steps necessary to transform an extension of Latent Dirichlet Allocation (LDA) with a fixed number of parameters to an equivalent model that grows the number of parameters with the amount data available apply to a broad range of models.

2 Generative models for documents and authors

We will describe two different models: The first one relates authors and documents via a fixed number of topics, and the second one models the interests of authors using a flexible number of topics. Both models are described by using the common notation of a document d being a vector of

words, , where all are chosen from a vocabulary with V terms, and are the authors of document d chosen from the set of all authors of size J. A corpus of D documents is then defined by the set .



a) b)
Figure 1: Admixture models for documents and authors: (a) The Author-Topic model, (b) the non-parametric Author-Topic model (this paper).

2.1 The parametric model

The seminal Author-Topic model [4] has two sets of unknown parameters; J distributions over topics conditioned on the authors, and K distributions over terms conditioned on the topics - as well as the assignments of individual words to authors and topics . With and being integrated out a collapsed gibbs sampler is used, analogous to [1], to converge to the true underlying distributions of the Markov state variables x and z. The transitions between the states of the chain result from iteratively sampling each pair as a block, conditioned on all other variables:

(1)

where is the number of times a word of the topic k has been assigned to the author j excluding the current instance, and is used in place of a variable to indicate that the sum over its values (e.g. ) is taken. The assignment variable represents the topic of the word in document d being k as represents the assignment to author j. The term on the right side is the posterior density of term t under topic k:

(2)

where is the number of times a term t has been assigned to topic k again excluding the current word from the count.

2.2 The non-parametric model

One frequently raised question when applying the Author-Topic model to a new data set, is how to choose the number of topics [5]. The Bayesian non-parametric framework of the Hierarchical Dirichlet Process (HDP) [6] offers an elegant solution to this by allowing a prior over a countably infinite number of topics of which only a few will dominate the posterior. Building on the finite version of the model we split the symmetric prior over topics into a scalar precision parameter and a distribution . Taking this to the limit we get the root distribution for the non-parametric Author-Topic model (fig.1b). Analogously to the collapsed gibbs sampler for the previous LDA version we integrate over , but keep

as an auxiliary variable to preserve the structure of the state transition probabilities in the finite case for the HDP

[2].

(3)

With being the prior density of a word under a new topic [6]. The key difference between these equations and the original model (1) is that we now have a root distribution for the HDP over K+1 possible states. If there are K topics in the current step, then represents the accumulated continuous probability mass of all possible but currently unused topics, allowing to choose a new one from a countably infinite pool of empty topics. If the count for number of words assigned to a topic goes to zero, the topic is returned to the pool of unused topics.

2.3 Sampling the Root Distribution

However, the construction of a Markov chain for the non-parametric Author-Topic model requires that additionally the root distribution of the Dirichlet processes must be sampled which was not present in the finite version of the model. The discrete part of the root distribution guarantees that existing topics are reused with probability and the continuous part allows for a new topic to be sampled with probability [3]. Given the Markov state we begin by generating J vectors

(4)

where

are the number of words for author j which have been assigned to topic k. Next, we draw Bernoulli random variables

. The posterior of the top-level Dirichlet process is then sampled via

(5)

making a discrete distribution over K used topics plus one component with the probability mass of the infinite possible, yet unused topics.

3 Discussion

In this work, we transformed the LDA based Author-Topic model into a non-parametric model that estimates the number of components necessary for representing the data. Yet, it will be necessary to empirically evaluate performance (i.e. perplexity) of the proposed model on benchmark data sets. While choosing the Author-Topic model as an example for such a transformation, we believe that many of the considerations made equally hold for a wider range of models and can serve as a blueprint for a simple application of non-parametric Bayesian priors.

References

  • Griffiths and Steyvers [2004] Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(1), 5228–5235 (2004)
  • Heinrich [2011] Heinrich, G.: “Infinite LDA” – Implementing the HDP with minimum code complexity. Technical report, arbylon.net (2011)
  • Porteous [2010] Porteous, I.: Mixture Block Methods for Non Parametric Bayesian Models with Applications. Ph.D. thesis (2010)
  • Rosen-Zvi et al. [2004] Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for Authors and Documents.

    In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004)

  • Sugimoto et al. [2011] Sugimoto, C., Li, D., Russell, T., Finlay, S., Ding, Y.: The shifting sands of disciplinary development: Analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology 62(1), 185–204 (2011)
  • Teh et al. [2006] Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)