1.5 billion words Arabic Corpus

11/12/2016
by   Ibrahim Abu El-Khair, et al.
0

This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2016

A Large Scale Corpus of Gulf Arabic

Most Arabic natural language processing tools and resources are develope...
research
06/14/2021

Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

We introduced the contemporary Amharic corpus, which is automatically ta...
research
07/12/2023

A Study on the Appropriate size of the Mongolian general corpus

This study aims to determine the appropriate size of the Mongolian gener...
research
04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...
research
08/13/2021

MIND - Mainstream and Independent News Documents Corpus

This paper presents and characterizes MIND, a new Portuguese corpus comp...
research
07/15/2015

Associative Measures and Multi-word Unit Extraction in Turkish

Associative measures are "mathematical formulas determining the strength...
research
11/18/2022

Corpus non alignés et ADT. Essai de comparaison entre les présidents français et brésiliens de l'ère contemporaine

Is there an ADT method that can deal with non-aligned bilingual corpora?...

Please sign up or login with your details

Forgot password? Click here to reset