ClusTop: An unsupervised and integrated text clustering and topic extraction framework

01/03/2023
by   Zhongtao Chen, et al.
0

Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.

READ FULL TEXT
research
09/26/2013

Integrating Document Clustering and Topic Modeling

Document clustering and topic modeling are two closely related tasks whi...
research
12/19/2022

Very Large Language Model as a Unified Methodology of Text Mining

Text data mining is the process of deriving essential information from l...
research
04/16/2021

Tracing Topic Transitions with Temporal Graph Clusters

Twitter serves as a data source for many Natural Language Processing (NL...
research
03/09/2019

Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

Currently, many intelligence systems contain the texts from multi-source...
research
12/16/2021

UniREx: A Unified Learning Framework for Language Model Rationale Extraction

An extractive rationale explains a language model's (LM's) prediction on...
research
09/02/2023

MPTopic: Improving topic modeling via Masked Permuted pre-training

Topic modeling is pivotal in discerning hidden semantic structures withi...
research
04/20/2023

CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering

Text clustering, as one of the most fundamental challenges in unsupervis...

Please sign up or login with your details

Forgot password? Click here to reset