Assigning Topics to Documents by Successive Projections

07/08/2021
by   Olga Klopp, et al.
0

Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlapping Clustering (SPOC) inspired by the Successive Projection Algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish theoretical guarantees on the performance of the SPOC algorithm, in particular, near matching minimax upper and lower bounds on its estimation risk. We also propose a new method that estimates the number of topics. We complement our theoretical results with a numerical study on synthetic and semi-synthetic data to analyze the performance of this new algorithm in practice. One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2018

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

We propose a new method of estimation in topic models, that is not a var...
research
03/30/2023

Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence

Extracting and identifying latent topics in large text corpora has gaine...
research
05/08/2014

Improving Image Clustering using Sparse Text and the Wisdom of the Crowds

We propose a method to improve image clustering using sparse text and th...
research
01/22/2020

Optimal estimation of sparse topic models

Topic models have become popular tools for dimension reduction and explo...
research
07/28/2023

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

A common way to explore text corpora is through low-dimensional projecti...
research
12/15/2020

Efficient Clustering from Distributions over Topics

There are many scenarios where we may want to find pairs of textually si...
research
07/18/2018

Efficient Training on Very Large Corpora via Gramian Estimation

We study the problem of learning similarity functions over very large co...

Please sign up or login with your details

Forgot password? Click here to reset