Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

03/22/2021
by   Isaac Caswell, et al.
33

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50 Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

READ FULL TEXT

Authors

page 1

10/03/2017

MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

Multilinguality is gradually becoming ubiquitous in the sense that more ...
10/27/2020

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...
03/26/2019

A New Approach for Semi-automatic Building and Extending a Multilingual Terminology Thesaurus

This paper describes a new system for semi-automatically building, exten...
01/17/2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...
02/24/2022

Toward More Meaningful Resources for Lower-resourced Languages

In this position paper, we describe our perspective on how meaningful re...
11/01/2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...
07/15/2020

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

This paper presents two colloquial Sinhala language corpora from the lan...

Code Repositories

quechua-nlp

Obtaining quality (Standard Southern) Quechua data for NLP applications


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.