MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

06/18/2022
by   Randa Zarnoufi, et al.
0

Social media user-generated text is actually the main resource for many NLP tasks. This text however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography, which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2019

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however info...
research
05/10/2019

Restoring Arabic vowels through omission-tolerant dictionary lookup

Vowels in Arabic are optional orthographic symbols written as diacritics...
research
09/11/2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Arabic is a widely-spoken language with a long and rich history, but exi...
research
03/20/2020

TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

This article describes the constitution process of the first morpho-synt...
research
01/26/2023

Beyond Arabic: Software for Perso-Arabic Script Manipulation

This paper presents an open-source software library that provides a set ...
research
05/25/2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...

Please sign up or login with your details

Forgot password? Click here to reset