Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

09/28/2019
by   Mona Diab, et al.
0

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2018

Guidelines and Annotation Framework for Arabic Author Profiling

In this paper, we present the annotation pipeline and the guidelines we ...
research
03/24/2017

Crowdsourcing Universal Part-Of-Speech Tags for Code-Switching

Code-switching is the phenomenon by which bilingual speakers switch betw...
research
11/22/2020

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

With the development of big corpora of various periods, it becomes cruci...
research
09/28/2019

WASA: A Web Application for Sequence Annotation

Data annotation is an important and necessary task for all NLP applicati...
research
08/23/2018

Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification

In this paper, we present Arap-Tweet, which is a large-scale and multi-d...
research
03/17/2022

Towards Responsible Natural Language Annotation for the Varieties of Arabic

When building NLP models, there is a tendency to aim for broader coverag...
research
10/13/2022

ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution

Large-scale, high-quality corpora are critical for advancing research in...

Please sign up or login with your details

Forgot password? Click here to reset