Style Over Substance: Evaluation Biases for Large Language Models

07/06/2023
by   Minghao Wu, et al.
0

As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Conventionally, human evaluations are considered the gold standard in natural language generation. Recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. Nonetheless, the extent to which humans and LLMs are capable evaluators remains uncertain. This study aims to investigate the behavior of both crowd-sourced human and LLM-based judges when comparing outputs from different models. To accomplish this, we curate a dataset comprising intentionally flawed machine-generated answers. Our findings indicate that despite the potentially greater danger posed by factual errors, answers with factual errors were still rated more favorably compared to answers that were too short or contained grammatical errors. This highlights a concerning bias in the evaluation process. To address this issue, we propose to independently evaluate machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, notable improvement is not observed in crowd-sourced-based evaluations, suggesting the need for further investigation and refinement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2023

Text Style Transfer Evaluation Using Large Language Models

Text Style Transfer (TST) is challenging to evaluate because the quality...
research
06/30/2021

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Human evaluations are typically considered the gold standard in natural ...
research
11/23/2022

SeedBERT: Recovering Annotator Rating Distributions from an Aggregated Label

Many machine learning tasks – particularly those in affective computing ...
research
11/09/2022

Robosourcing Educational Resources – Leveraging Large Language Models for Learnersourcing

In this article, we introduce and evaluate the concept of robosourcing f...
research
05/11/2023

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain ...
research
06/21/2023

Understanding Social Reasoning in Language Models with Language Models

As Large Language Models (LLMs) become increasingly integrated into our ...
research
08/17/2023

Semantic Consistency for Assuring Reliability of Large Language Models

Large Language Models (LLMs) exhibit remarkable fluency and competence a...

Please sign up or login with your details

Forgot password? Click here to reset