Does Corpus Quality Really Matter for Low-Resource Languages?

03/15/2022
by   Mikel Artetxe, et al.
0

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66 EusCrawl, in contrast with <33 similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for methods to exploit more diverse data sources.

READ FULL TEXT
research
10/16/2021

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

We show that unsupervised sequence-segmentation performance can be trans...
research
05/21/2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-li...
research
09/02/2021

An Empirical Exploration in Quality Filtering of Text Data

While conventional wisdom suggests that more aggressively filtering data...
research
09/13/2023

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

Understanding procedural natural language (e.g., step-by-step instructio...
research
01/28/2022

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Parallel corpora are ideal for extracting a multilingual named entity (M...
research
10/26/2022

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

We present Bloom Library, a linguistically diverse set of multimodal and...
research
11/01/2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...

Please sign up or login with your details

Forgot password? Click here to reset