How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

12/31/2020
by   Phillip Rust, et al.
12

In this work we provide a systematic empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first establish if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for a performance difference. To disentangle the impacting variables, we train new monolingual models on the same data, but with different tokenizers, both the monolingual and the multilingual version. We find that while the pretraining data size is an important factor, the designated tokenizer of the monolingual model plays an equally important role in the downstream performance. Our results show that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

READ FULL TEXT
research
03/22/2022

Factual Consistency of Multilingual Pretrained Language Models

Pretrained language models can be queried for factual knowledge, with po...
research
10/11/2022

Are Pretrained Multilingual Models Equally Fair Across Languages?

Pretrained multilingual language models can help bridge the digital lang...
research
12/12/2022

Ensembling Transformers for Cross-domain Automatic Term Extraction

Automatic term extraction plays an essential role in domain language und...
research
03/17/2022

Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

The performance of multilingual pretrained models is highly dependent on...
research
05/24/2021

RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

We present RobeCzech, a monolingual RoBERTa language representation mode...
research
06/08/2023

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

A few benchmarking datasets have been released to evaluate the factual k...
research
04/16/2023

Sabiá: Portuguese Large Language Models

As the capabilities of language models continue to advance, it is concei...

Please sign up or login with your details

Forgot password? Click here to reset