Simplifying Multilingual News Clustering Through Projection From a Shared Space

04/28/2022
by   João Santos, et al.
0

The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time. Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded. With that in mind, we present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features. We empirically demonstrate that the use of multilingual contextual embeddings as the document representation significantly improves clustering quality. We challenge previous crosslingual approaches by removing the precondition of building monolingual clusters. We model the clustering process as a set of linear classifiers to aggregate similar documents, and correct closely-related multilingual clusters through merging in an online fashion. Our system achieves state-of-the-art results on a multilingual news stream clustering dataset, and we introduce a new evaluation for zero-shot news clustering in multiple languages. We make our code available as open-source.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2018

Multilingual Clustering of Streaming News

Clustering news across languages enables efficient media monitoring by a...
research
01/26/2021

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

We propose a method for online news stream clustering that is a variant ...
research
02/04/2022

Twitter Referral Behaviours on News Consumption with Ensemble Clustering of Click-Stream Data in Turkish Media

Click-stream data, which comes with a massive volume generated by the hu...
research
04/17/2020

Batch Clustering for Multilingual News Streaming

Nowadays, digital news articles are widely available, published by vario...
research
07/12/2020

Xiaomingbot: A Multilingual Robot News Reporter

This paper proposes the building of Xiaomingbot, an intelligent, multili...
research
11/21/2022

Extended Multilingual Protest News Detection – Shared Task 1, CASE 2021 and 2022

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest...
research
05/22/2023

Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media

Automated stance detection and related machine learning methods can prov...

Please sign up or login with your details

Forgot password? Click here to reset