Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

08/03/2023
by   Veronika Hackl, et al.
0

This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.

READ FULL TEXT
research
03/15/2018

RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers fro...
research
03/14/2023

Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences

As a natural language assistant, ChatGPT is capable of performing variou...
research
07/19/2022

Selecting applicants based on multiple ratings: Using binary classification framework as an alternative to inter-rater reliability

Inter-rater reliability (IRR) has been the prevalent quality and precisi...
research
03/24/2022

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Since the inception of crowdsourcing, aggregation has been a common stra...
research
05/06/2021

Stylistic Analysis of the French Presidential Speeches: Is Macron really different?

Presidential speeches indicate the government's intentions and justifica...
research
04/09/2023

Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance

ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) th...
research
11/15/2018

HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty

The way we perceive a sound depends on many aspects-- its ecological fre...

Please sign up or login with your details

Forgot password? Click here to reset