Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

01/28/2021
by   Elena Zotova, et al.
0

Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are hindered by their relatively small size and skewed class distribution. This paper presents a method to obtain multilingual datasets for stance detection in Twitter. Instead of manually annotating on a per tweet basis, we leverage user-based information to semi-automatically label large amounts of tweets. Empirical monolingual and cross-lingual experimentation and qualitative analysis show that our method helps to overcome the aforementioned difficulties to build large, balanced and multilingual labeled corpora. We believe that our method can be easily adapted to easily generate labeled social media data for other Natural Language Processing tasks and domains.

READ FULL TEXT
research
03/31/2020

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...
research
04/02/2023

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Social media plays a significant role in cross-cultural communication. A...
research
10/26/2020

UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection on Social Media by Fine-tuning a Variety of BERT-based Models

Offensive language detection is one of the most challenging problem in t...
research
07/07/2020

Cross-lingual Inductive Transfer to Detect Offensive Language

With the growing use of social media and its availability, many instance...
research
05/03/2021

Looking for COVID-19 misinformation in multilingual social media texts

This paper presents the Multilingual COVID-19 Analysis Method (CMTA) for...
research
04/30/2021

Cross-lingual hate speech detection based on multilingual domain-specific word embeddings

Automatic hate speech detection in online social networks is an importan...
research
10/22/2022

Stance Detection and Open Research Avenues

This tutorial aims to cover the state-of-the-art on stance detection and...

Please sign up or login with your details

Forgot password? Click here to reset