Measuring LDA Topic Stability from Clusters of Replicated Runs

by   Mika Mäntylä, et al.

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.


page 1

page 2

page 3

page 4


TopicsRanksDC: Distance-based Topic Ranking applied on Two-Class Data

In this paper, we introduce a novel approach named TopicsRanksDC for top...

Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

Understanding the shopping motivations behind market baskets has high co...

Text Mining-Based Patent Analysis for Automated Rule Checking in AEC

Automated rule checking (ARC), which is expected to promote the efficien...

Topic Detection from Conversational Dialogue Corpus with Parallel Dirichlet Allocation Model and Elbow Method

A conversational system needs to know how to switch between topics to co...

A new LDA formulation with covariates

The Latent Dirichlet Allocation (LDA) model is a popular method for crea...

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Context: Topic modeling finds human-readable structures in unstructured ...

Unification of HDP and LDA Models for Optimal Topic Clustering of Subject Specific Question Banks

There has been an increasingly popular trend in Universities for curricu...

Please sign up or login with your details

Forgot password? Click here to reset