Zero-shot NLG evaluation through Pairware Comparisons with LLMs

07/15/2023
by   Adian Liusie, et al.
0

Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.

READ FULL TEXT
research
03/15/2023

UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

Large Language Models (LLMs) are popular for their impressive abilities,...
research
04/09/2023

A Preliminary Evaluation of ChatGPT for Zero-shot Dialogue Understanding

Zero-shot dialogue understanding aims to enable dialogue to track the us...
research
05/23/2023

ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

We introduce ZeroSCROLLS, a zero-shot benchmark for natural language und...
research
10/22/2022

ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

Recently, dataset-generation-based zero-shot learning has shown promisin...
research
04/10/2020

Designing Precise and Robust Dialogue Response Evaluators

Automatic dialogue response evaluator has been proposed as an alternativ...
research
05/22/2023

Distilling ChatGPT for Explainable Automated Student Answer Assessment

Assessing student answers and providing valuable feedback is crucial for...
research
06/15/2023

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

The unprecedented performance of large language models (LLMs) necessitat...

Please sign up or login with your details

Forgot password? Click here to reset