Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

01/17/2022
by   Julien Abadji, et al.
0

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.

READ FULL TEXT

page 5

page 7

research
06/30/2022

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...
research
04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
09/06/2023

Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

Hypothetical induction is recognized as the main reasoning type when sci...
research
10/27/2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...
research
10/20/2020

Natural Language Inference with Mixed Effects

There is growing evidence that the prevalence of disagreement in the raw...
research
06/20/2018

TxPI-u: A Resource for Personality Identification of Undergraduates

Resources such as labeled corpora are necessary to train automatic model...

Please sign up or login with your details

Forgot password? Click here to reset