BERTuit: Understanding Spanish language in Twitter through a native transformer

04/07/2022
by   Javier Huertas-Tato, et al.
2

The appearance of complex attention-based language models such as BERT, Roberta or GPT-3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever-changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state-of-the-art multilingual language models are applied to this scenario, language specific nuances use to get lost in translation. To face these challenges we present BERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: a zero-shot methodology to visualize groups of hoaxes and profiling authors spreading disinformation. Misinformation spreads wildly on platforms such as Twitter in languages other than English, meaning performance of transformers may suffer when transferred outside English speaking communities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2022

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

We present TwHIN-BERT, a multilingual language model trained on in-domai...
research
02/18/2020

From English To Foreign Languages: Transferring Pre-trained Language Models

Pre-trained models have demonstrated their effectiveness in many downstr...
research
10/07/2020

Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation

Tweets are specific text data when compared to general text. Although se...
research
11/13/2021

SocialBERT – Transformers for Online SocialNetwork Language Modelling

The ubiquity of the contemporary language understanding tasks gives rele...
research
08/30/2021

On the Multilingual Capabilities of Very Large-Scale English Language Models

Generative Pre-trained Transformers (GPTs) have recently been scaled to ...
research
05/31/2021

An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers

Most studies on word-level Quality Estimation (QE) of machine translatio...
research
04/18/2022

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Both in scientific literature and in industry,, Semantic and context-awa...

Please sign up or login with your details

Forgot password? Click here to reset