CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

11/01/2019
by   Guillaume Wenzek, et al.
13

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

READ FULL TEXT
research
03/30/2023

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Pre-training Large Language Models (LLMs) require massive amounts of tex...
research
08/08/2023

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotatio...
research
09/01/2021

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

This paper introduces a human-in-the-loop (HITL) data annotation pipelin...
research
04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
01/25/2022

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web dumps for diverse text ...
research
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...
research
03/11/2020

From Algebraic Word Problem to Program: A Formalized Approach

In this paper, we propose a pipeline to convert grade school level algeb...

Please sign up or login with your details

Forgot password? Click here to reset