Enhancement of Short Text Clustering by Iterative Classification

01/31/2020
by   Md Rashadul Hasan Rakib, et al.
0

Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means–, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2021

Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds

Short text stream clustering is an important but challenging task since ...
research
01/31/2021

Short Text Clustering with Transformers

Recent techniques for the task of short text clustering often rely on wo...
research
03/09/2019

Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

Currently, many intelligence systems contain the texts from multi-source...
research
09/19/2018

Clustering students' open-ended questionnaire answers

Open responses form a rich but underused source of information in educat...
research
04/22/2011

Robust Clustering Using Outlier-Sparsity Regularization

Notwithstanding the popularity of conventional clustering algorithms suc...
research
01/25/2019

Subspace Clustering of Very Sparse High-Dimensional Data

In this paper we consider the problem of clustering collections of very ...
research
01/31/2010

Classifying the typefaces of the Gutenberg 42-line bible

We have measured the dissimilarities among several printed characters of...

Please sign up or login with your details

Forgot password? Click here to reset