Romanization-based Large-scale Adaptation of Multilingual Language Models

04/18/2023
by   Sukannya Purkayastha, et al.
0

Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts

Massively multilingual language models such as multilingual BERT (mBERT)...
research
04/30/2020

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

The main goal behind state-of-the-art pretrained multilingual models suc...
research
11/05/2019

Unsupervised Cross-lingual Representation Learning at Scale

This paper shows that pretraining multilingual language models at scale ...
research
06/07/2021

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Recent research in multilingual language models (LM) has demonstrated th...
research
05/22/2023

Automatic Readability Assessment for Closely Related Languages

In recent years, the main focus of research on automatic readability ass...
research
10/31/2019

Pseudolikelihood Reranking with Masked Language Models

We rerank with scores from pretrained masked language models like BERT t...
research
09/09/2023

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Pre-trained multilingual language models underpin a large portion of mod...

Please sign up or login with your details

Forgot password? Click here to reset