DeepAI AI Chat
Log In Sign Up

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

by   Sebastian Gehrmann, et al.

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of corpora and evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the initial release for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.


page 3

page 9

page 10

page 15

page 20

page 21

page 23

page 24


Data-driven Natural Language Generation: Paving the Road to Success

We argue that there are currently two major bottlenecks to the commercia...

Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective

We address the fundamental challenge in Natural Language Generation (NLG...

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied pr...

A Survey of Evaluation Metrics Used for NLG Systems

The success of Deep Learning has created a surge in interest in a wide a...

A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text

Java Code Generation consists in generating automatically Java code from...

Measuring Attribution in Natural Language Generation Models

With recent improvements in natural language generation (NLG) models for...