NusaCrowd: Open Source Initiative for Indonesian NLP Resources

12/19/2022
by   Samuel Cahyawijaya, et al.
0

We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.

READ FULL TEXT

page 1

page 4

research
03/30/2021

Collaborative construction of lexicographic and parallel datasets for African languages: first assessment

Faced with a considerable lack of resources in African languages to carr...
research
07/21/2022

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

At the center of the underlying issues that halt Indonesian natural lang...
research
07/14/2022

Open Terminology Management and Sharing Toolkit for Federation of Terminology Databases

Consolidated access to current and reliable terms from different subject...
research
08/24/2022

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

A cornerstone in AI research has been the creation and adoption of stand...
research
11/28/2022

Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

While the NLP community is generally aware of resource disparities among...
research
05/24/2023

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Despite the major advances in NLP, significant disparities in NLP system...
research
07/23/2020

AI4D – African Language Dataset Challenge

As language and speech technologies become more advanced, the lack of fu...

Please sign up or login with your details

Forgot password? Click here to reset