Are All Languages Created Equal in Multilingual BERT?

05/18/2020
by   Shijie Wu, et al.
0

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging, and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.

READ FULL TEXT
research
04/30/2020

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

The main goal behind state-of-the-art pretrained multilingual models suc...
research
04/28/2020

Extending Multilingual BERT to Low-Resource Languages

Multilingual BERT (M-BERT) has been a huge success in both supervised an...
research
05/21/2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-li...
research
05/30/2021

How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Despite the recent advancements of attention-based deep learning archite...
research
09/15/2021

A Conditional Generative Matching Model for Multi-lingual Reply Suggestion

We study the problem of multilingual automated reply suggestions (RS) mo...
research
04/16/2021

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning

The combination of multilingual pre-trained representations and cross-li...
research
12/31/2020

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts

Massively multilingual language models such as multilingual BERT (mBERT)...

Please sign up or login with your details

Forgot password? Click here to reset