Extending the Subwording Model of Multilingual Pretrained Models for New Languages

11/29/2022
by   Kenji Imamura, et al.
0

Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2022

Language-Family Adapters for Multilingual Neural Machine Translation

Massively multilingual models pretrained on abundant corpora with self-s...
research
08/20/2020

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

In natural language processing (NLP), there is a need for more resources...
research
04/18/2023

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Pretrained multilingual large language models have typically used heuris...
research
10/11/2022

Are Pretrained Multilingual Models Equally Fair Across Languages?

Pretrained multilingual language models can help bridge the digital lang...
research
09/13/2021

Wine is Not v i n. – On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretraine...
research
02/28/2023

Extending English IR methods to multi-lingual IR

This paper describes our participation in the 2023 WSDM CUP - MIRACL cha...
research
09/16/2020

Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT

Using a language model (LM) pretrained on two languages with large monol...

Please sign up or login with your details

Forgot password? Click here to reset