Analyses of Multi-collection Corpora via Compound Topic Modeling

06/17/2019
by   Clint P. George, et al.
0

As electronically stored data grow in daily life, obtaining novel and relevant information becomes challenging in text mining. Thus people have sought statistical methods based on term frequency, matrix algebra, or topic modeling for text mining. Popular topic models have centered on one single text collection, which is deficient for comparative text analyses. We consider a setting where one can partition the corpus into subcollections. Each subcollection shares a common set of topics, but there exists relative variation in topic proportions among collections. Including any prior knowledge about the corpus (e.g. organization structure), we propose the compound latent Dirichlet allocation (cLDA) model, improving on previous work, encouraging generalizability, and depending less on user-input parameters. To identify the parameters of interest in cLDA, we study Markov chain Monte Carlo (MCMC) and variational inference approaches extensively, and suggest an efficient MCMC method. We evaluate cLDA qualitatively and quantitatively using both synthetic and real-world corpora. The usability study on some real-world corpora illustrates the superiority of cLDA to explore the underlying topics automatically but also model their connections and variations across multiple collections.

READ FULL TEXT
research
07/04/2016

Temporal Topic Analysis with Endogenous and Exogenous Processes

We consider the problem of modeling temporal textual data taking endogen...
research
11/12/2015

Bayesian Analysis of Dynamic Linear Topic Models

In dynamic topic modeling, the proportional contribution of a topic to a...
research
11/21/2021

Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

Topic evolution modeling has received significant attentions in recent d...
research
04/13/2018

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

To make sense of large amounts of textual data, topic modelling is frequ...
research
09/06/2015

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to...
research
01/15/2018

Topic Modeling on Health Journals with Regularized Variational Inference

Topic modeling enables exploration and compact representation of a corpu...
research
05/09/2012

On Smoothing and Inference for Topic Models

Latent Dirichlet analysis, or topic modeling, is a flexible latent varia...

Please sign up or login with your details

Forgot password? Click here to reset