Are Pretrained Multilingual Models Equally Fair Across Languages?

10/11/2022
by   Laura Cabello Piqueras, et al.
0

Pretrained multilingual language models can help bridge the digital language divide, enabling high-quality NLP models for lower resourced languages. Studies of multilingual models have so far focused on performance, consistency, and cross-lingual generalisation. However, with their wide-spread application in the wild and downstream societal impact, it is important to put multilingual models under the same scrutiny as monolingual models. This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages. To this end, we create a new four-way multilingual dataset of parallel cloze test examples (MozArt), equipped with demographic information (balanced with regard to gender and native tongue) about the test participants. We evaluate three multilingual models on MozArt – mBERT, XLM-R, and mT5 – and show that across the four target languages, the three models exhibit different levels of group disparity, e.g., exhibiting near-equal risk for Spanish, but high levels of disparity for German.

READ FULL TEXT
research
12/31/2020

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

In this work we provide a systematic empirical comparison of pretrained ...
research
11/29/2022

Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Multilingual pretrained models are effective for machine translation and...
research
08/31/2023

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

We present Belebele, a multiple-choice machine reading comprehension (MR...
research
10/16/2022

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World

Linguistic disparity in the NLP world is a problem that has been widely ...
research
04/18/2018

Experiments with Universal CEFR Classification

The Common European Framework of Reference (CEFR) guidelines describe la...
research
09/13/2021

Wine is Not v i n. – On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretraine...
research
03/17/2022

Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

The performance of multilingual pretrained models is highly dependent on...

Please sign up or login with your details

Forgot password? Click here to reset