esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...
research
03/22/2021

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling i...
research
04/23/2020

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Organisations disclose their privacy practices by posting privacy polici...
research
03/13/2020

Know thy corpus! Robust methods for digital curation of Web corpora

This paper proposes a novel framework for digital curation of Web corpor...
research
10/27/2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...
research
09/17/2023

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

The driving factors behind the development of large language models (LLM...
research
08/24/2016

A Large-Scale Multilingual Disambiguation of Glosses

Linking concepts and named entities to knowledge bases has become a cruc...

Please sign up or login with your details

Forgot password? Click here to reset