Is ChatGPT a Good NLG Evaluator? A Preliminary Study

03/07/2023
by   Jiaan Wang, et al.
0

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of NLG models is an arduous task and previous statistical metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to score the generation of NLG models. We conduct experiments on three widely-used NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with golden human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2020

Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an ...
research
03/23/2023

Is ChatGPT A Good Keyphrase Generator? A Preliminary Study

The emergence of ChatGPT has recently garnered significant attention fro...
research
05/24/2023

Is GPT-4 a Good Data Analyst?

As large language models (LLMs) have demonstrated their powerful capabil...
research
05/13/2022

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...
research
10/09/2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...
research
12/02/2021

InfoLM: A New Metric to Evaluate Summarization Data2Text Generation

Assessing the quality of natural language generation systems through hum...
research
02/01/2018

Correlation and Prediction of Evaluation Metrics in Information Retrieval

Because researchers typically do not have the time or space to present m...

Please sign up or login with your details

Forgot password? Click here to reset