VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection

01/15/2020
by   Yuzhen Ding, et al.
0

Topic modeling has found wide application in many problems where latent structures of the data are crucial for typical inference tasks. When applying a topic model, a relatively standard pre-processing step is to first build a vocabulary of frequent words. Such a general pre-processing step is often independent of the topic modeling stage, and thus there is no guarantee that the pre-generated vocabulary can support the inference of some optimal (or even meaningful) topic models appropriate for a given task, especially for computer vision applications involving "visual words". In this paper, we propose a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA), which learns the latent model while simultaneously selecting most relevant words. The selection of words is driven by an entropy-based metric that measures the relative contribution of the words to the underlying model, and is done dynamically while the model is learned. We present three variants of VSEC-LDA and evaluate the proposed approach with experiments on both synthetic and real databases from different applications. The results demonstrate the effectiveness of built-in vocabulary selection and its importance in improving the performance of topic modeling.

READ FULL TEXT
research
05/04/2012

Variable Selection for Latent Dirichlet Allocation

In latent Dirichlet allocation (LDA), topics are multinomial distributio...
research
05/15/2014

Topic words analysis based on LDA model

Social network analysis (SNA), which is a research field describing and ...
research
10/16/2014

Graph-Sparse LDA: A Topic Model with Structured Sparsity

Originally designed to model text, topic modeling has become a powerful ...
research
02/13/2023

Visualizing Topic Uncertainty in Topic Modelling

Word clouds became a standard tool for presenting results of natural lan...
research
09/02/2013

Scalable Probabilistic Entity-Topic Modeling

We present an LDA approach to entity disambiguation. Each topic is assoc...
research
01/12/2017

Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling

Latent Dirichlet Allocation (LDA) models trained without stopword remova...
research
10/26/2014

A provable SVD-based algorithm for learning topics in dominant admixture corpus

Topic models, such as Latent Dirichlet Allocation (LDA), posit that docu...

Please sign up or login with your details

Forgot password? Click here to reset