GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

01/17/2021
by   Daniel Khashabi, et al.
7

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

READ FULL TEXT
06/26/2020

Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG...
06/17/2020

Automatically Ranked Russian Paraphrase Corpus for Text Generation

The article is focused on automatic development and ranking of a large c...
10/16/2021

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation

Fast and reliable evaluation metrics are key to R D progress. While tr...
04/11/2022

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...
04/11/2022

Toward More Effective Human Evaluation for Machine Translation

Improvements in text generation technologies such as machine translation...
06/21/2022

Automatic Pull Request Title Generation

Pull Requests (PRs) are a mechanism on modern collaborative coding platf...
05/05/2020

Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures

Generating coherent, grammatically correct, and meaningful text is very ...

Code Repositories

evaluation-interfaces

Evaluation interfaces for generative models


view repo