Stability of Topic Modeling via Matrix Factorization

02/23/2017
by   Mark Belford, et al.
0

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of "instability" which has previously been studied in the context of k-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2021

Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

Legacy procedures for topic modelling have generally suffered problems o...
research
08/21/2022

SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection

As the amount of text data continues to grow, topic modeling is serving ...
research
10/18/2021

Uncertainty-aware Topic Modeling Visualization

Topic modeling is a state-of-the-art technique for analyzing text corpor...
research
05/08/2014

Improving Image Clustering using Sparse Text and the Wisdom of the Crowds

We propose a method to improve image clustering using sparse text and th...
research
01/31/2023

Archetypal Analysis++: Rethinking the Initialization Strategy

Archetypal analysis is a matrix factorization method with convexity cons...
research
08/02/2022

No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

Extracting knowledge from unlabeled texts using machine learning algorit...
research
11/01/2016

Robust Spectral Inference for Joint Stochastic Matrix Factorization

Spectral inference provides fast algorithms and provable optimality for ...

Please sign up or login with your details

Forgot password? Click here to reset