Experiments on Generalizability of BERTopic on Multi-Domain Short Text

12/16/2022
by   Muriël de Groot, et al.
0

Topic modeling is widely used for analytically evaluating large collections of textual data. One of the most popular topic techniques is Latent Dirichlet Allocation (LDA), which is flexible and adaptive, but not optimal for e.g. short texts from various domains. We explore how the state-of-the-art BERTopic algorithm performs on short multi-domain text and find that it generalizes better than LDA in terms of topic coherence and diversity. We further analyze the performance of the HDBSCAN clustering algorithm utilized by BERTopic and find that it classifies a majority of the documents as outliers. This crucial, yet overseen problem excludes too many documents from further analysis. When we replace HDBSCAN with k-Means, we achieve similar performance, but without outliers.

READ FULL TEXT

page 1

page 2

page 3

research
11/12/2017

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for...
research
01/24/2013

Transfer Topic Modeling with Ease and Scalability

The increasing volume of short texts generated on social media sites, su...
research
03/26/2020

Bag of biterms modeling for short texts

Analyzing texts from social media encounters many challenges due to thei...
research
10/22/2015

Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling

There is an explosion of data, documents, and other content, and people ...
research
04/13/2018

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

To make sense of large amounts of textual data, topic modelling is frequ...
research
02/03/2014

A high-reproducibility and high-accuracy method for automated topic classification

Much of human knowledge sits in large databases of unstructured text. Le...
research
05/01/2019

Nested Variational Autoencoder for Topic Modeling on Microtexts with Word Vectors

Most of the information on the Internet is represented in the form of mi...

Please sign up or login with your details

Forgot password? Click here to reset