Multilingual Stance Detection: The Catalonia Independence Corpus

03/31/2020
by   Elena Zotova, et al.
0

Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the independence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2023

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

This paper introduces PMIndiaSum, a new multilingual and massively paral...
research
01/28/2021

Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

Popular social media networks provide the perfect environment to study t...
research
10/19/2022

Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection

The lack of wide coverage datasets annotated with everyday metaphorical ...
research
09/24/2021

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

The development of automated approaches to linguistic acceptability has ...
research
04/30/2020

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...
research
05/24/2018

A Corpus for Multilingual Document Classification in Eight Languages

Cross-lingual document classification aims at training a document classi...
research
03/31/2020

MULTEXT-East

MULTEXT-East language resources, a multilingual dataset for language eng...

Please sign up or login with your details

Forgot password? Click here to reset