INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

05/23/2023
by   Wenda Xu, et al.
0

The field of automatic evaluation of text generation made tremendous progress in the last few years. In particular, since the advent of neural metrics, like COMET, BLEURT, and SEScore2, the newest generation of metrics show a high correlation with human judgment. Unfortunately, quality scores generated with neural metrics are not interpretable, and it is unclear which part of the generation output is criticized by the metrics. To address this limitation, we present INSTRUCTSCORE, an open-source, explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT4, we fine-tune a LLAMA model to create an evaluative metric that can produce a diagnostic report aligned with human judgment. We evaluate INSTRUCTSCORE on the WMT22 Zh-En translation task, where our 7B model surpasses other LLM-based baselines, including those based on 175B GPT3. Impressively, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which was fine-tuned on human ratings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2020

NUBIA: NeUral Based Interchangeability Assessor for Text Generation

We present NUBIA, a methodology to build automatic evaluation metrics fo...
research
03/05/2020

RecipeGPT: Generative Pre-training Based Cooking Recipe Generation and Evaluation System

Interests in the automatic generation of cooking recipes have been growi...
research
08/07/2020

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks re...
research
08/23/2021

CGEMs: A Metric Model for Automatic Code Generation using GPT-3

Today, AI technology is showing its strengths in almost every industry a...
research
05/22/2023

GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

One of the essential human skills is the ability to seamlessly build an ...
research
04/09/2020

BLEURT: Learning Robust Metrics for Text Generation

Text generation has made significant advances in the last few years. Yet...
research
04/11/2022

TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factu...

Please sign up or login with your details

Forgot password? Click here to reset