Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

03/09/2019
by   Jianping Cao, et al.
0

Currently, many intelligence systems contain the texts from multi-sources, e.g., bulletin board system (BBS) posts, tweets and news. These texts can be "comparative" since they may be semantically correlated and thus provide us with different perspectives toward the same topics or events. To better organize the multi-sourced texts and obtain more comprehensive knowledge, we propose to study the novel problem of Mutual Clustering on Comparative Texts (MCCT), which aims to cluster the comparative texts simultaneously and collaboratively. The MCCT problem is difficult to address because 1) comparative texts usually present different data formats and structures and thus they are hard to organize, and 2) there lacks an effective method to connect the semantically correlated comparative texts to facilitate clustering them in an unified way. To this aim, in this paper we propose a Heterogeneous Information Network-based Text clustering framework HINT. HINT first models multi-sourced texts (e.g. news and tweets) as heterogeneous information networks by introducing the shared "anchor texts" to connect the comparative texts. Next, two similarity matrices based on HINT as well as a transition matrix for cross-text-source knowledge transfer are constructed. Comparative texts clustering are then conducted by utilizing the constructed matrices. Finally, a mutual clustering algorithm is also proposed to further unify the separate clustering results of the comparative texts by introducing a clustering consistency constraint. We conduct extensive experimental on three tweets-news datasets, and the results demonstrate the effectiveness and robustness of the proposed method in addressing the MCCT problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2020

Enhancement of Short Text Clustering by Iterative Classification

Short text clustering is a challenging task due to the lack of signal co...
research
01/03/2023

ClusTop: An unsupervised and integrated text clustering and topic extraction framework

Text clustering and topic extraction are two important tasks in text min...
research
03/17/2022

Short Text Topic Modeling: Application to tweets about Bitcoin

Understanding the semantic of a collection of texts is a challenging tas...
research
06/14/2018

Improved Density-Based Spatio--Textual Clustering on Social Media

DBSCAN may not be sufficient when the input data type is heterogeneous i...
research
11/25/2022

MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

Event detection (ED) identifies and classifies event triggers from unstr...
research
05/30/2023

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Classifying the same event reported by different countries is of signifi...
research
07/31/2020

Forensic Writer Identification Using Microblogging Texts

Establishing the authorship of online texts is a fundamental issue to co...

Please sign up or login with your details

Forgot password? Click here to reset