Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

10/13/2021
by   Zaid Alyafeai, et al.
0

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create Masader, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

READ FULL TEXT
research
03/24/2022

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

NLP research is impeded by a lack of resources and awareness of the chal...
research
08/23/2018

Guidelines and Annotation Framework for Arabic Author Profiling

In this paper, we present the annotation pipeline and the guidelines we ...
research
01/25/2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

In recent years, large-scale data collection efforts have prioritized th...
research
02/17/2023

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Africa is home to over 2000 languages from over six language families an...
research
10/21/2022

Joint Coreference Resolution for Zeros and non-Zeros in Arabic

Most existing proposals about anaphoric zero pronoun (AZP) resolution re...
research
07/11/2020

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Natural Language Processing (NLP) is increasingly used as a key ingredie...
research
08/01/2022

Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets

Masader (Alyafeai et al., 2021) created a metadata structure to be used ...

Please sign up or login with your details

Forgot password? Click here to reset