Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

by   Daniel Atzberger, et al.

Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.


page 6

page 7

page 8

page 9


Spectral Overlap and a Comparison of Parameter-Free, Dimensionality Reduction Quality Metrics

Nonlinear dimensionality reduction methods are a popular tool for data s...

Application of Fuzzy Clustering for Text Data Dimensionality Reduction

Large textual corpora are often represented by the document-term frequen...

Detecting covariate drift in text data using document embeddings and dimensionality reduction

Detecting covariate drift in text data is essential for maintaining the ...

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

A common way to explore text corpora is through low-dimensional projecti...

A Practical Algorithm for Topic Modeling with Provable Guarantees

Topic models provide a useful method for dimensionality reduction and ex...

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora

Understanding how human semantic knowledge is organized and how people u...

Visualizing and Understanding Large-Scale Assessments in Mathematics through Dimensionality Reduction

In this paper, we apply the Logistic PCA (LPCA) as a dimensionality redu...

Please sign up or login with your details

Forgot password? Click here to reset