Representation Learning for Short Text Clustering

09/21/2021
by   Hui Yin, et al.
0

Effective representation learning is critical for short text clustering due to the sparse, high-dimensional and noise attributes of short text corpus. Existing pre-trained models (e.g., Word2vec and BERT) have greatly improved the expressiveness for short text representations with more condensed, low-dimensional and continuous features compared to the traditional Bag-of-Words (BoW) model. However, these models are trained for general purposes and thus are suboptimal for the short text clustering task. In this paper, we propose two methods to exploit the unsupervised autoencoder (AE) framework to further tune the short text representations based on these pre-trained text models for optimal clustering performance. In our first method Structural Text Network Graph Autoencoder (STN-GAE), we exploit the structural text information among the corpus by constructing a text network, and then adopt graph convolutional network as encoder to fuse the structural features with the pre-trained text features for text representation learning. In our second method Soft Cluster Assignment Autoencoder (SCA-AE), we adopt an extra soft cluster assignment constraint on the latent space of autoencoder to encourage the learned text representations to be more clustering-friendly. We tested two methods on seven popular short text datasets, and the experimental results show that when only using the pre-trained model for short text clustering, BERT performs better than BoW and Word2vec. However, as long as we further tune the pre-trained representations, the proposed method like SCA-AE can greatly increase the clustering performance, and the accuracy improvement compared to use BERT alone could reach as much as 14%.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/01/2017

Self-Taught Convolutional Neural Networks for Short Text Clustering

Short text clustering is a challenging problem due to its sparseness of ...
research
04/16/2022

SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words

Pre-trained models are widely used in the tasks of natural language proc...
research
06/19/2021

JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs

Existing pre-trained models for knowledge-graph-to-text (KG-to-text) gen...
research
12/08/2019

Attentive Representation Learning with Adversarial Training for Short Text Clustering

Short text clustering has far-reaching effects on semantic analysis, sho...
research
04/27/2023

Deep Spatiotemporal Clustering: A Temporal Clustering Approach for Multi-dimensional Climate Data

Clustering high-dimensional spatiotemporal data using an unsupervised ap...
research
04/20/2023

CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering

Text clustering, as one of the most fundamental challenges in unsupervis...
research
06/11/2021

A deep learning approach to clustering visual arts

Clustering artworks is difficult for several reasons. On the one hand, r...

Please sign up or login with your details

Forgot password? Click here to reset