Style Over Substance: Evaluation Biases for Large Language Models

by   Minghao Wu, et al.

As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Conventionally, human evaluations are considered the gold standard in natural language generation. Recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. Nonetheless, the extent to which humans and LLMs are capable evaluators remains uncertain. This study aims to investigate the behavior of both crowd-sourced human and LLM-based judges when comparing outputs from different models. To accomplish this, we curate a dataset comprising intentionally flawed machine-generated answers. Our findings indicate that despite the potentially greater danger posed by factual errors, answers with factual errors were still rated more favorably compared to answers that were too short or contained grammatical errors. This highlights a concerning bias in the evaluation process. To address this issue, we propose to independently evaluate machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, notable improvement is not observed in crowd-sourced-based evaluations, suggesting the need for further investigation and refinement.


page 1

page 2

page 3

page 4


Text Style Transfer Evaluation Using Large Language Models

Text Style Transfer (TST) is challenging to evaluate because the quality...

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Human evaluations are typically considered the gold standard in natural ...

SeedBERT: Recovering Annotator Rating Distributions from an Aggregated Label

Many machine learning tasks – particularly those in affective computing ...

Robosourcing Educational Resources – Leveraging Large Language Models for Learnersourcing

In this article, we introduce and evaluate the concept of robosourcing f...

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain ...

Understanding Social Reasoning in Language Models with Language Models

As Large Language Models (LLMs) become increasingly integrated into our ...

Semantic Consistency for Assuring Reliability of Large Language Models

Large Language Models (LLMs) exhibit remarkable fluency and competence a...

Please sign up or login with your details

Forgot password? Click here to reset