GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

01/17/2021
by   Daniel Khashabi, et al.
7

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

READ FULL TEXT
06/26/2020

Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG...
06/17/2020

Automatically Ranked Russian Paraphrase Corpus for Text Generation

The article is focused on automatic development and ranking of a large c...
12/15/2022

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Large language models show improved downstream task performance when pro...
10/16/2021

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation

Fast and reliable evaluation metrics are key to R D progress. While tr...
04/04/2023

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Human evaluation is critical for validating the performance of text-to-i...
06/21/2022

Automatic Pull Request Title Generation

Pull Requests (PRs) are a mechanism on modern collaborative coding platf...
10/24/2022

On the Effectiveness of Automated Metrics for Text Generation Systems

A major challenge in the field of Text Generation is evaluation because ...

Code Repositories

evaluation-interfaces

Evaluation interfaces for generative models


view repo

Please sign up or login with your details

Forgot password? Click here to reset