MPTopic: Improving topic modeling via Masked Permuted pre-training

09/02/2023
by   Xinche Zhang, et al.
0

Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.

READ FULL TEXT
research
09/26/2013

Integrating Document Clustering and Topic Modeling

Document clustering and topic modeling are two closely related tasks whi...
research
03/27/2023

Improving Contextualized Topic Models with Negative Sampling

Topic modeling has emerged as a dominant method for exploring large docu...
research
01/03/2023

ClusTop: An unsupervised and integrated text clustering and topic extraction framework

Text clustering and topic extraction are two important tasks in text min...
research
02/08/2015

Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documen...
research
12/02/2015

Probabilistic Latent Semantic Analysis (PLSA) untuk Klasifikasi Dokumen Teks Berbahasa Indonesia

One task that is included in managing documents is how to find substanti...
research
01/26/2014

Painting Analysis Using Wavelets and Probabilistic Topic Models

In this paper, computer-based techniques for stylistic analysis of paint...
research
11/30/2020

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Authorial clustering involves the grouping of documents written by the s...

Please sign up or login with your details

Forgot password? Click here to reset