Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

by   Mona Diab, et al.

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1



page 1

page 2

page 3

page 4


Guidelines and Annotation Framework for Arabic Author Profiling

In this paper, we present the annotation pipeline and the guidelines we ...

Crowdsourcing Universal Part-Of-Speech Tags for Code-Switching

Code-switching is the phenomenon by which bilingual speakers switch betw...

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

With the development of big corpora of various periods, it becomes cruci...

WASA: A Web Application for Sequence Annotation

Data annotation is an important and necessary task for all NLP applicati...

Towards Responsible Natural Language Annotation for the Varieties of Arabic

When building NLP models, there is a tendency to aim for broader coverag...

#MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo Movement

In this paper, we present a dataset containing 9,973 tweets related to t...

Annotation Curricula to Implicitly Train Non-Expert Annotators

Annotation studies often require annotators to familiarize themselves wi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.