DeepAI AI Chat
Log In Sign Up

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

by   Muhammad Khalifa, et al.

A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only – identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as ˜10% F_1 (NER), 2% accuracy (POS tagging), and 4.5% F_1 (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community.


Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

A sufficient amount of annotated data is usually required to fine-tune p...

Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

We present state-of-the-art results on morphosyntactic tagging across di...

Target-Oriented Fine-tuning for Zero-Resource Named Entity Recognition

Zero-resource named entity recognition (NER) severely suffers from data ...

A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks

The Arabic language is a morphological rich language, posing many challe...

Impact of Position Bias on Language Models in Token Classification

Language Models (LMs) have shown state-of-the-art performance in Natural...

Document Automation Architectures: Updated Survey in Light of Large Language Models

This paper surveys the current state of the art in document automation (...

Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages

Most combinations of NLP tasks and language varieties lack in-domain exa...