Wine is Not v i n. – On the Compatibility of Tokenizations Across Languages

09/13/2021
by   Antonis Maronikolakis, et al.
6

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs. "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.

READ FULL TEXT

page 1

page 7

page 8

page 13

page 15

page 16

page 17

page 18

research
05/27/2022

HiJoNLP at SemEval-2022 Task 2: Detecting Idiomaticity of Multiword Expressions using Multilingual Pretrained Language Models

This paper describes an approach to detect idiomaticity only from the co...
research
11/29/2022

Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Multilingual pretrained models are effective for machine translation and...
research
09/08/2021

Discrete and Soft Prompting for Multilingual Models

It has been shown for English that discrete and soft prompting perform s...
research
01/25/2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Large multilingual language models typically rely on a single vocabulary...
research
08/31/2023

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

We present Belebele, a multiple-choice machine reading comprehension (MR...
research
10/11/2022

Are Pretrained Multilingual Models Equally Fair Across Languages?

Pretrained multilingual language models can help bridge the digital lang...
research
12/06/2021

Indian Kidney Exchange Program: A Game Theoretic Perspective

We propose a ways in which Kidney exchange can be feasibly, economically...

Please sign up or login with your details

Forgot password? Click here to reset