A Large Scale Corpus of Gulf Arabic

09/09/2016
by   Salam Khalifa, et al.
0

Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.

READ FULL TEXT
research
12/28/2016

Shamela: A Large-Scale Historical Arabic Corpus

Arabic is a widely-spoken language with a rich and long history spanning...
research
08/23/2018

Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification

In this paper, we present Arap-Tweet, which is a large-scale and multi-d...
research
11/30/2022

Camelira: An Arabic Multi-Dialect Morphological Disambiguator

We present Camelira, a web-based Arabic multi-dialect morphological disa...
research
09/16/2021

Automatic Error Type Annotation for Arabic

We present ARETA, an automatic error type annotation system for Modern S...
research
11/12/2016

1.5 billion words Arabic Corpus

This study is an attempt to build a contemporary linguistic corpus for A...
research
12/13/2022

Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

This article presents morphologically-annotated Yemeni, Sudanese, Iraqi,...
research
05/24/2016

Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT

In this paper, we present an ongoing effort in lexical semantic analysis...

Please sign up or login with your details

Forgot password? Click here to reset