IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

09/11/2020
by   Bryan Wilie, et al.
0

Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, and thus it enables everyone to benchmark their system performances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2019

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Inspired by the success of the General Language Understanding Evaluation...
research
01/01/2021

BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding

Pre-training language models on large volume of data with self-supervise...
research
08/30/2023

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

Kyrgyz is a very underrepresented language in terms of modern natural la...
research
08/30/2022

FDB: Fraud Dataset Benchmark

Standardized datasets and benchmarks have spurred innovations in compute...
research
10/06/2022

Time Will Change Things: An Empirical Study on Dynamic Language Understanding in Social Media Classification

Language features are ever-evolving in the real-world social media envir...
research
07/20/2023

A Dataset and Strong Baselines for Classification of Czech News Texts

Pre-trained models for Czech Natural Language Processing are often evalu...
research
02/15/2022

Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

In the last year, new neural architectures and multilingual pre-trained ...

Please sign up or login with your details

Forgot password? Click here to reset