Shamela: A Large-Scale Historical Arabic Corpus

12/28/2016
by   Yonatan Belinkov, et al.
0

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Arabic is a widely-spoken language with a long and rich history, but exi...
research
09/09/2016

A Large Scale Corpus of Gulf Arabic

Most Arabic natural language processing tools and resources are develope...
research
05/19/2022

Curras + Baladi: Towards a Levantine Corpus

The processing of the Arabic language is a complex field of research. Th...
research
09/27/2017

A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp

With the development of electronic media and the heterogeneity of Arabic...
research
05/29/2018

Automatic Identification of Arabic expressions related to future events in Lebanon's economy

In this paper, we propose a method to automatically identify future even...
research
12/13/2022

Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

This article presents morphologically-annotated Yemeni, Sudanese, Iraqi,...
research
05/28/2020

A Corpus for Large-Scale Phonetic Typology

A major hurdle in data-driven research on typology is having sufficient ...

Please sign up or login with your details

Forgot password? Click here to reset