An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by   ElMehdi Boujou, et al.

Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for the Arabic language and its national dialects. However, such open access labeled data sets in Arabic and its dialects are lacking in the Data Science ecosystem and this lack can be a burden to innovation and research in this field. In this work, we present an open data set of social data content in several Arabic dialects. This data was collected from the Twitter social network and consists on +50K twits in five (5) national dialects. Furthermore, this data was labeled for several applications, namely dialect detection, topic detection and sentiment analysis. We publish this data as an open access data to encourage innovation and encourage other works in the field of NLP for Arabic dialects and social media. A selection of models were built using this data set and are presented in this paper along with their performances.



There are no comments yet.



Open data for Moroccan license plates for OCR applications : data collection, labeling, and model construction

Significant number of researches have been developed recently around int...

Arabic Language Sentiment Analysis on Health Services

The social media network phenomenon leads to a massive amount of valuabl...

A System for Extracting Sentiment from Large-Scale Arabic Social Data

Social media data in Arabic language is becoming more and more abundant....

A Panoramic Survey of Natural Language Processing in the Arab World

The term natural language refers to any system of symbolic communication...

Hateful People or Hateful Bots? Detection and Characterization of Bots Spreading Religious Hatred in Arabic Social Media

Arabic Twitter space is crawling with bots that fuel political feuds, sp...

AraNet: A Deep Learning Toolkit for Arabic Social Media

We describe AraNet, a collection of deep learning Arabic social media pr...

A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

In academia, plagiarism is certainly not an emerging concern, but it bec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.