Generalized Topic Modeling

11/04/2016
by   Avrim Blum, et al.
0

Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution a⃗_i over words, and a document is generated by first selecting a mixture w⃗ over topics, and then generating words i.i.d. from the associated mixture Aw⃗. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we aim to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. We present several natural conditions under which one can do this efficiently and discuss issues such as noise tolerance and sample complexity in this model. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2013

Using Multiple Samples to Learn Mixture Models

In the mixture models problem it is assumed that there are K distributio...
research
08/05/2020

BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally re...
research
08/12/2020

Neural Sinkhorn Topic Model

In this paper, we present a new topic modelling approach via the theory ...
research
07/12/2021

Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations

This paper studies the estimation of high-dimensional, discrete, possibl...
research
10/26/2014

A provable SVD-based algorithm for learning topics in dominant admixture corpus

Topic models, such as Latent Dirichlet Allocation (LDA), posit that docu...
research
03/15/2013

Topic Discovery through Data Dependent and Random Projections

We present algorithms for topic modeling based on the geometry of cross-...
research
10/27/2022

Truncation Sampling as Language Model Desmoothing

Long samples of text from neural language models can be of poor quality....

Please sign up or login with your details

Forgot password? Click here to reset