Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

10/26/2022
by   Colin Leong, et al.
0

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2023

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

Understanding procedural natural language (e.g., step-by-step instructio...
research
01/14/2022

A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...
research
02/15/2023

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Large multilingual models have inspired a new class of word alignment me...
research
11/03/2020

Towards Code-switched Classification Exploiting Constituent Language Resources

Code-switching is a commonly observed communicative phenomenon denoting ...
research
03/02/2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning...
research
06/27/2022

Endowing Language Models with Multimodal Knowledge Graph Representations

We propose a method to make natural language understanding models more p...
research
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...

Please sign up or login with your details

Forgot password? Click here to reset