DeepAI AI Chat
Log In Sign Up

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

by   Isaac Caswell, et al.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50 Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.


MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

Multilinguality is gradually becoming ubiquitous in the sense that more ...

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...

A New Approach for Semi-automatic Building and Extending a Multilingual Terminology Thesaurus

This paper describes a new system for semi-automatically building, exten...

Toward More Meaningful Resources for Lower-resourced Languages

In this position paper, we describe our perspective on how meaningful re...

A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions

We present a large, multilingual study into how vision constrains lingui...

DINFRA: A One Stop Shop for Computing Multilingual Semantic Relatedness

This demonstration presents an infrastructure for computing multilingual...

Code Repositories


Obtaining quality (Standard Southern) Quechua data for NLP applications

view repo