Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

06/07/2021
by   Yash Khemchandani, et al.
0

Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and propose RelateLM. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script), and (2) sentence structure. RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case). While exploiting similar sentence structures, RelateLM utilizes readily available bilingual dictionaries to pseudo translate RPL text into LRL corpora. Experiments on multiple real-world benchmark datasets provide validation to our hypothesis that using a related language as pivot, along with transliteration and pseudo translation based data augmentation, can be an effective way to adapt LMs for LRLs, rather than direct training or pivoting through English.

READ FULL TEXT
research
10/24/2020

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

Transfer learning based on pretraining language models on a large amount...
research
10/13/2022

Bootstrapping Multilingual Semantic Parsers using Large Language Models

Despite cross-lingual generalization demonstrated by pre-trained multili...
research
04/18/2023

Romanization-based Large-scale Adaptation of Multilingual Language Models

Large multilingual pretrained language models (mPLMs) have become the de...
research
02/23/2023

In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages

Multilingual generative language models (LMs) are increasingly fluent in...
research
05/01/2020

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Building natural language processing systems for non standardized and lo...
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...
research
06/11/2019

What Kind of Language Is Hard to Language-Model?

How language-agnostic are current state-of-the-art NLP tools? Are there ...

Please sign up or login with your details

Forgot password? Click here to reset