Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

by   Soumajyoti Sarkar, et al.

The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +6.41). We then explore two continual pre-training methods – (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks (+4.64 avg. gain) when used on monolingual models.


page 1

page 2

page 3

page 4


PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Multilingual pre-training significantly improves many multilingual NLP t...

Multilingual Translation via Grafting Pre-trained Language Models

Can pre-trained BERT for one language and GPT for another be glued toget...

Irony Detection in a Multilingual Context

This paper proposes the first multilingual (French, English and Arabic) ...

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Recent impressive improvements in NLP, largely based on the success of c...

nmT5 – Is parallel data still relevant for pre-training massively multilingual language models?

Recently, mT5 - a massively multilingual version of T5 - leveraged a uni...

Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization

Most of previous work on learning diacritization of the Arabic language ...

When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale

Multilingual machine translation (MMT), trained on a mixture of parallel...

Please sign up or login with your details

Forgot password? Click here to reset