GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

by   Daniel Khashabi, et al.

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.


Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG...

Automatically Ranked Russian Paraphrase Corpus for Text Generation

The article is focused on automatic development and ranking of a large c...

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Large language models show improved downstream task performance when pro...

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation

Fast and reliable evaluation metrics are key to R D progress. While tr...

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Human evaluation is critical for validating the performance of text-to-i...

Automatic Pull Request Title Generation

Pull Requests (PRs) are a mechanism on modern collaborative coding platf...

On the Effectiveness of Automated Metrics for Text Generation Systems

A major challenge in the field of Text Generation is evaluation because ...

Code Repositories


Evaluation interfaces for generative models

view repo

Please sign up or login with your details

Forgot password? Click here to reset