Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction

10/20/2021
by   Shubhanshu Mishra, et al.
0

We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37 F1 across target languages) and sentiment classification (12 improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7 improvement in accuracy). Our results are promising given the lack of social media bitext corpus. Our code can be found at: https://github.com/twitter-research/multilingual-alignment-tpp.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/27/2021

Rumour Detection via Zero-shot Cross-lingual Transfer Learning

Most rumour detection models for social media are designed for one speci...
research
05/01/2020

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers

Massively multilingual transformers pretrained with language modeling ob...
research
10/10/2019

Language Transfer for Early Warning of Epidemics from Social Media

Statements on social media can be analysed to identify individuals who a...
research
09/13/2021

Exploring a Unified Sequence-To-Sequence Transformer for Medical Product Safety Monitoring in Social Media

Adverse Events (AE) are harmful events resulting from the use of medical...
research
04/29/2022

ExaASC: A General Target-Based Stance Detection Corpus in Arabic Language

Target-based Stance Detection is the task of finding a stance toward a t...
research
06/14/2021

Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Hate speech and profanity detection suffer from data sparsity, especiall...
research
11/21/2016

False-Friend Detection and Entity Matching via Unsupervised Transliteration

Transliterations play an important role in multilingual entity reference...

Please sign up or login with your details

Forgot password? Click here to reset