Does Transliteration Help Multilingual Language Modeling?

01/29/2022
by   Ibraheem Muhammad Moosa, et al.
0

As there is a scarcity of large representative corpora for most languages, it is important for Multilingual Language Models (MLLM) to extract the most out of existing corpora. In this regard, script diversity presents a challenge to MLLMs by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. In this paper, we pretrain two ALBERT models to empirically measure the effect of transliteration on MLLMs. We specifically focus on the Indo-Aryan language family, which has the highest script diversity in the world. Afterward, we evaluate our models on the IndicGLUE benchmark. We perform Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity (CLRS) of the models using centered kernel alignment (CKA) on parallel sentences of eight languages from the FLORES-101 dataset. We find that the hidden representations of the transliteration-based model have higher and more stable CLRS scores. Our code is available at Github (github.com/ibraheem-moosa/XLM-Indic) and Hugging Face Hub (huggingface.co/ibraheemmoosa/xlmindic-base-multiscript and huggingface.co/ibraheemmoosa/xlmindic-base-uniscript).

READ FULL TEXT
research
05/09/2023

Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

We address the task of machine translation from an extremely low-resourc...
research
05/22/2023

Automatic Readability Assessment for Closely Related Languages

In recent years, the main focus of research on automatic readability ass...
research
03/03/2022

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Pre-trained multilingual language models such as mBERT and XLM-R have de...
research
02/22/2021

Bilingual Language Modeling, A transfer learning technique for Roman Urdu

Pretrained language models are now of widespread use in Natural Language...
research
05/20/2023

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

The NLP community has mainly focused on scaling Large Language Models (L...
research
10/24/2022

Adapters for Enhanced Modeling of Multilingual Knowledge and Text

Large language models appear to learn facts from the large text corpora ...
research
06/07/2021

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Recent research in multilingual language models (LM) has demonstrated th...

Please sign up or login with your details

Forgot password? Click here to reset