Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

10/13/2021
by   Zaid Alyafeai, et al.
0

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create Masader, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

READ FULL TEXT

Authors

page 11

03/24/2022

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

NLP research is impeded by a lack of resources and awareness of the chal...
01/25/2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

In recent years, large-scale data collection efforts have prioritized th...
08/23/2018

Guidelines and Annotation Framework for Arabic Author Profiling

In this paper, we present the annotation pipeline and the guidelines we ...
07/11/2020

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Natural Language Processing (NLP) is increasingly used as a key ingredie...
09/12/2019

NSURL-2019 Shared Task 8: Semantic Question Similarity in Arabic

Question semantic similarity (Q2Q) is a challenging task that is very us...
12/31/2020

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Recently, pretrained transformer-based architectures have proven to be v...
05/19/2022

Curras + Baladi: Towards a Levantine Corpus

The processing of the Arabic language is a complex field of research. Th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.