Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

05/26/2023
by   Tomasz Limisiewicz, et al.
0

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training

READ FULL TEXT

page 5

page 8

page 15

page 17

page 18

research
04/09/2020

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large numbe...
research
10/24/2020

Improving Multilingual Models with Language-Clustered Vocabularies

State-of-the-art multilingual models depend on vocabularies that cover a...
research
11/27/2019

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

With language modeling becoming the popular base task for unsupervised r...
research
01/25/2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Large multilingual language models typically rely on a single vocabulary...
research
11/23/2022

Word-Level Representation From Bytes For Language Modeling

Modern language models mostly take sub-words as input, a design that bal...
research
06/09/2020

Examination and Extension of Strategies for Improving Personalized Language Modeling via Interpolation

In this paper, we detail novel strategies for interpolating personalized...
research
02/25/2020

Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction

Language-independent tokenisation (LIT) methods that do not require labe...

Please sign up or login with your details

Forgot password? Click here to reset