Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence

03/30/2023
by   Anton Thielmann, et al.
0

Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. This allows our model to detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.

READ FULL TEXT
research
10/15/2018

Improving Topic Models with Latent Feature Word Representations

Probabilistic topic models are widely used to discover latent topics in ...
research
03/29/2019

Re-Ranking Words to Improve Interpretability of Automatically Generated Topics

Topics models, such as LDA, are widely used in Natural Language Processi...
research
07/27/2022

CompText: Visualizing, Comparing Understanding Text Corpus

A common practice in Natural Language Processing (NLP) is to visualize t...
research
06/09/2022

Analyzing Folktales of Different Regions Using Topic Modeling and Clustering

This paper employs two major natural language processing techniques, top...
research
08/13/2016

Analysis of Morphology in Topic Modeling

Topic models make strong assumptions about their data. In particular, di...
research
01/12/2017

Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling

Latent Dirichlet Allocation (LDA) models trained without stopword remova...
research
07/08/2021

Assigning Topics to Documents by Successive Projections

Topic models provide a useful tool to organize and understand the struct...

Please sign up or login with your details

Forgot password? Click here to reset