XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

01/25/2023
by   Davis Liang, et al.
3

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2019

Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success i...
research
12/01/2015

Multilingual Language Processing From Bytes

We describe an LSTM-based model which we call Byte-to-Span (BTS) that re...
research
05/26/2023

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Multilingual language models have recently gained attention as a promisi...
research
11/12/2018

Combining Learned Lyrical Structures and Vocabulary for Improved Lyric Generation

The use of language models for generating lyrics and poetry has received...
research
09/13/2021

Wine is Not v i n. – On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretraine...
research
09/02/2023

Multilingual Text Representation

Modern NLP breakthrough includes large multilingual models capable of pe...
research
05/23/2023

FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language

Using model weights pretrained on a high-resource language as a warm sta...

Please sign up or login with your details

Forgot password? Click here to reset