TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

03/20/2020
by   Elisa Gugliotta, et al.
0

This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Arabic is a widely-spoken language with a long and rich history, but exi...
research
07/11/2022

TArC: Tunisian Arabish Corpus First complete release

In this paper we present the final result of a project on Tunisian Arabi...
research
11/22/2022

ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

We present our work on collecting ArzEn-ST, a code-switched Egyptian Ara...
research
11/15/2015

A System for Extracting Sentiment from Large-Scale Arabic Social Data

Social media data in Arabic language is becoming more and more abundant....
research
05/05/2020

Digraphie des langues ouest africaines : Latin2Ajami : un algorithme de translitteration automatique

The national languages of Senegal, like those of West Africa country in ...
research
06/18/2022

MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

Social media user-generated text is actually the main resource for many ...
research
05/15/2020

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and tra...

Please sign up or login with your details

Forgot password? Click here to reset