naab: A ready-to-use plug-and-play corpus for Farsi

08/29/2022
by   Sadra Sabouri, et al.
0

Huge corpora of textual data are always known to be a crucial need for training deep models such as transformer-based ones. This issue is emerging more in lower resource languages - like Farsi. We propose naab, the biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.

READ FULL TEXT

page 3

page 4

research
11/18/2022

Corpus non alignés et ADT. Essai de comparaison entre les présidents français et brésiliens de l'ère contemporaine

Is there an ADT method that can deal with non-aligned bilingual corpora?...
research
09/18/2020

The birth of Romanian BERT

Large-scale pretrained language models have become ubiquitous in Natural...
research
07/06/2020

Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

We present a new release of the Czech-English parallel corpus CzEng 2.0 ...
research
05/18/2020

Corpus of Chinese Dynastic Histories: Gender Analysis over Two Millennia

Chinese dynastic histories form a large continuous linguistic space of a...
research
12/19/2018

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely po...
research
04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...
research
07/22/2011

Analogy perception applied to seven tests of word comprehension

It has been argued that analogy is the core of cognition. In AI research...

Please sign up or login with your details

Forgot password? Click here to reset