Curras + Baladi: Towards a Levantine Corpus

05/19/2022
by   Karim El Haff, et al.
0

The processing of the Arabic language is a complex field of research. This is due to many factors, including the complex and rich morphology of Arabic, its high degree of ambiguity, and the presence of several regional varieties that need to be processed while taking into account their unique characteristics. When its dialects are taken into account, this language pushes the limits of NLP to find solutions to problems posed by its inherent nature. It is a diglossic language; the standard language is used in formal settings and in education and is quite different from the vernacular languages spoken in the different regions and influenced by older languages that were historically spoken in those regions. This should encourage NLP specialists to create dialect-specific corpora such as the Palestinian morphologically annotated Curras corpus of Birzeit University. In this work, we present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens. Since Lebanese and Palestinian dialects are part of the same Levantine dialectal continuum, and thus highly mutually intelligible, our proposed corpus was constructed to be used to (1) enrich Curras and transform it into a more general Levantine corpus and (2) improve Curras by solving detected errors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Arabic is a widely-spoken language with a long and rich history, but exi...
research
12/28/2016

Shamela: A Large-Scale Historical Arabic Corpus

Arabic is a widely-spoken language with a rich and long history spanning...
research
03/21/2021

SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German

Swiss German is a dialect continuum whose natively acquired dialects sig...
research
10/21/2022

Graphemic Normalization of the Perso-Arabic Script

Since its original appearance in 1991, the Perso-Arabic script represent...
research
04/12/2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...
research
11/30/2021

Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi

Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of ...
research
06/22/1999

Resolving Part-of-Speech Ambiguity in the Greek Language Using Learning Techniques

This article investigates the use of Transformation-Based Error-Driven l...

Please sign up or login with your details

Forgot password? Click here to reset