A provable SVD-based algorithm for learning topics in dominant admixture corpus

10/26/2014
by   Trapit Bansal, et al.
0

Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from admixtures, is NP-hard. Assuming separability, a strong assumption, [4] gave the first provable algorithm for inference. For LDA model, [6] gave a provable algorithm using tensor-methods. But [4,6] do not learn topic vectors with bounded l_1 error (a natural measure for probability vectors). Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded l_1 error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on w_0, the lowest probability that a topic is dominant, and is better than [4]. Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art [5].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2012

A Spectral Algorithm for Latent Dirichlet Allocation

The problem of topic modeling can be seen as a generalization of the clu...
research
06/02/2016

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources

A popular approach to topic modeling involves extracting co-occurring n-...
research
11/04/2016

Generalized Topic Modeling

Recently there has been significant activity in developing algorithms wi...
research
11/25/2019

Discovering topics with neural topic models built from PLSA assumptions

In this paper we present a model for unsupervised topic discovery in tex...
research
02/19/2016

Spectral Learning for Supervised Topic Models

Supervised topic models simultaneously model the latent topic structure ...
research
06/26/2018

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in whi...
research
01/15/2020

VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection

Topic modeling has found wide application in many problems where latent ...

Please sign up or login with your details

Forgot password? Click here to reset