Bilingual Language Modeling, A transfer learning technique for Roman Urdu

02/22/2021
by   Usama Khalid, et al.
0

Pretrained language models are now of widespread use in Natural Language Processing. Despite their success, applying them to Low Resource languages is still a huge challenge. Although Multilingual models hold great promise, applying them to specific low-resource languages e.g. Roman Urdu can be excessive. In this paper, we show how the code-switching property of languages may be used to perform cross-lingual transfer learning from a corresponding high resource language. We also show how this transfer learning technique termed Bilingual Language Modeling can be used to produce better performing models for Roman Urdu. To enable training and experimentation, we also present a collection of novel corpora for Roman Urdu extracted from various sources and social networking sites, e.g. Twitter. We train Monolingual, Multilingual, and Bilingual models of Roman Urdu - the proposed bilingual model achieves 23 accuracy compared to the 2 respectively in the Masked Language Modeling (MLM) task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2019

Regularization Advantages of Multilingual Neural Language Models for Low Resource Domains

Neural language modeling (LM) has led to significant improvements in sev...
research
05/02/2021

Larger-Scale Transformers for Multilingual Masked Language Modeling

Recent work has demonstrated the effectiveness of cross-lingual language...
research
10/08/2018

Zero-Resource Multilingual Model Transfer: Learning What to Share

Modern natural language processing and understanding applications have e...
research
12/07/2022

JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset

JamPatoisNLI provides the first dataset for natural language inference i...
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...
research
10/25/2021

Paradigm Shift in Language Modeling: Revisiting CNN for Modeling Sanskrit Originated Bengali and Hindi Language

Though there has been a large body of recent works in language modeling ...
research
11/04/2019

Emerging Cross-lingual Structure in Pretrained Language Models

We study the problem of multilingual masked language modeling, i.e. the ...

Please sign up or login with your details

Forgot password? Click here to reset