Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

01/25/2022
by   Angelina McMillan-Major, et al.
2

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

READ FULL TEXT
research
02/07/2023

Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models

Learned metrics such as BLEURT have in recent years become widely employ...
research
10/13/2021

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

The NLP pipeline has evolved dramatically in the last few years. The fir...
research
08/23/2018

Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification

In this paper, we present Arap-Tweet, which is a large-scale and multi-d...
research
05/10/2023

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

We present Vārta, a large-scale multilingual dataset for headline genera...
research
08/25/2023

On the Impact of Language Selection for Training and Evaluating Programming Language Models

The recent advancements in Transformer-based Language Models have demons...
research
10/18/2021

A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification

Relation classification (sometimes called 'extraction') requires trustwo...
research
12/20/2022

Geographic and Geopolitical Biases of Language Models

Pretrained language models (PLMs) often fail to fairly represent target ...

Please sign up or login with your details

Forgot password? Click here to reset