Can Large Language Models Be an Alternative to Human Evaluations?

05/03/2023
by   Cheng-Han Chiang, et al.
0

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2023

Can Very Large Pretrained Language Models Learn Storytelling With A Few Examples?

While pre-trained language models can generate individually fluent sente...
research
03/01/2023

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Large Language Models (LLMs) especially ChatGPT have produced impressive...
research
06/05/2020

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Automatic evaluation of various text quality criteria produced by data-d...
research
12/25/2021

A Preliminary Study for Literary Rhyme Generation based on Neuronal Representation, Semantics and Shallow Parsing

In recent years, researchers in the area of Computational Creativity hav...
research
07/14/2023

Investigating ChatGPT's Potential to Assist in Requirements Elicitation Processes

Natural Language Processing (NLP) for Requirements Engineering (RE) (NLP...
research
05/12/2022

AiSocrates: Towards Answering Ethical Quandary Questions

Considerable advancements have been made in various NLP tasks based on t...
research
09/15/2021

Challenges in Detoxifying Language Models

Large language models (LM) generate remarkably fluent text and can be ef...

Please sign up or login with your details

Forgot password? Click here to reset