Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

by   Xinyan Velocity Yu, et al.

While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.


NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Natural language processing (NLP) has a significant impact on society vi...

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Despite much progress in recent years, the vast majority of work in natu...

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World

Linguistic disparity in the NLP world is a problem that has been widely ...

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

We present NusaCrowd, a collaborative initiative to collect and unite ex...

Digitization of the Australian Parliamentary Debates, 1998-2022

Public knowledge of what is said in parliament is a tenet of democracy, ...

Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities

Contextualizing language technologies beyond a single language kindled e...

Toward More Meaningful Resources for Lower-resourced Languages

In this position paper, we describe our perspective on how meaningful re...

Please sign up or login with your details

Forgot password? Click here to reset