Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

by   Isaac Caswell, et al.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50 Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.



page 1


MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

Multilinguality is gradually becoming ubiquitous in the sense that more ...

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...

A New Approach for Semi-automatic Building and Extending a Multilingual Terminology Thesaurus

This paper describes a new system for semi-automatically building, exten...

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...

Toward More Meaningful Resources for Lower-resourced Languages

In this position paper, we describe our perspective on how meaningful re...

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

This paper presents two colloquial Sinhala language corpora from the lan...

Code Repositories


Obtaining quality (Standard Southern) Quechua data for NLP applications

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.