IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

by   Samuel Cahyawijaya, et al.

A benchmark provides an ecosystem to measure the advancement of models with standard datasets and automatic and human evaluation metrics. We introduce IndoNLG, the first such benchmark for the Indonesian language for natural language generation (NLG). It covers six tasks: summarization, question answering, open chitchat, as well as three different language-pairs of machine translation tasks. We provide a vast and clean pre-training corpus of Indonesian, Sundanese, and Javanese datasets called Indo4B-Plus, which is used to train our pre-trained NLG model, IndoBART. We evaluate the effectiveness and efficiency of IndoBART by conducting extensive evaluation on all IndoNLG tasks. Our findings show that IndoBART achieves competitive performance on Indonesian tasks with five times fewer parameters compared to the largest multilingual model in our benchmark, mBART-LARGE (Liu et al., 2020), and an almost 4x and 2.5x faster inference time on the CPU and GPU respectively. We additionally demonstrate the ability of IndoBART to learn Javanese and Sundanese, and it achieves decent performance on machine translation tasks.


page 1

page 2

page 3

page 4


A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Large language models such as BERT and the GPT series started a paradigm...

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

We explore efficient evaluation metrics for Natural Language Generation ...

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

We introduce GEM, a living benchmark for natural language Generation (NL...

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...

A Natural Bias for Language Generation Models

After just a few hundred training updates, a standard probabilistic mode...

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The development of large language models (LLMs) such as ChatGPT has brou...

Please sign up or login with your details

Forgot password? Click here to reset