Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

12/13/2022
by   Mustafa Jarrar, et al.
0

This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (  1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (  50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2016

A Large Scale Corpus of Gulf Arabic

Most Arabic natural language processing tools and resources are develope...
research
08/23/2018

Guidelines and Annotation Framework for Arabic Author Profiling

In this paper, we present the annotation pipeline and the guidelines we ...
research
12/02/2019

Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

In this paper, we present the first publicly available part-of-speech an...
research
12/28/2016

Shamela: A Large-Scale Historical Arabic Corpus

Arabic is a widely-spoken language with a rich and long history spanning...
research
04/28/2022

Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

Even though hate speech (HS) online has been an important object of rese...
research
03/03/2020

Seshat: A tool for managing and verifying annotation campaigns of audio data

We introduce Seshat, a new, simple and open-source software to efficient...
research
09/20/2023

Hate speech detection in algerian dialect using deep learning

With the proliferation of hate speech on social networks under different...

Please sign up or login with your details

Forgot password? Click here to reset