Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

11/29/2018
by   Tiehang Duan, et al.
0

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2017

From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

In this paper, we propose a novel approach for text classification based...
research
01/20/2016

Hierarchical Latent Word Clustering

This paper presents a new Bayesian non-parametric model by extending the...
research
02/23/2019

Vector of Locally-Aggregated Word Embeddings (VLAWE): A novel document-level embedding

In this paper, we propose a novel representation for text documents base...
research
04/18/2020

Effect of Text Color on Word Embeddings

In natural scenes and documents, we can find the correlation between a t...
research
06/03/2019

Contextually Propagated Term Weights for Document Representation

Word embeddings predict a word from its neighbours by learning small, de...
research
09/19/2017

MetaLDA: a Topic Model that Efficiently Incorporates Meta information

Besides the text content, documents and their associated words usually c...
research
08/24/2021

Hybrid Multisource Feature Fusion for the Text Clustering

The text clustering technique is an unsupervised text mining method whic...

Please sign up or login with your details

Forgot password? Click here to reset