Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

05/30/2023
by   Lin Wu, et al.
0

Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on transla-tors would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combina-tion of cross-lingual vectors for headlines and topic probability distributions for con-tent, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include ad-justing the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2022

Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

Previous studies have proved that cross-lingual knowledge distillation c...
research
12/15/2021

Learning Cross-Lingual IR from an English Retriever

We present a new cross-lingual information retrieval (CLIR) model traine...
research
05/23/2023

Linear Cross-Lingual Mapping of Sentence Embeddings

Semantics of a sentence is defined with much less ambiguity than semanti...
research
11/02/2022

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Pre-trained multilingual language models play an important role in cross...
research
08/06/2021

Transferring Knowledge Distillation for Multilingual Social Event Detection

Recently published graph neural networks (GNNs) show promising performan...
research
12/07/2021

Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Current state-of-the-art cross-lingual summarization models employ multi...
research
03/09/2019

Mutual Clustering on Comparative Texts via Heterogeneous Information Networks

Currently, many intelligence systems contain the texts from multi-source...

Please sign up or login with your details

Forgot password? Click here to reset