Large Language Models are not Fair Evaluators

05/29/2023
by   Peiyi Wang, et al.
0

We uncover a systematic bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit FairEval, along with the human annotations.[<https://github.com/i-Eval/FairEval>]

READ FULL TEXT
research
08/30/2023

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Aligning large language models (LLMs) with human values and intents crit...
research
05/19/2023

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Large language models (LLMs), such as ChatGPT, are prone to generate hal...
research
02/05/2023

FineDeb: A Debiasing Framework for Language Models

As language models are increasingly included in human-facing machine lea...
research
06/28/2023

CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

Holistically measuring societal biases of large language models is cruci...
research
09/21/2023

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

Large language models (LLMs) have demonstrated impressive capabilities i...
research
05/23/2023

Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment

Despite the remarkable recent advances in language models, they still st...
research
10/17/2022

Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prom...

Please sign up or login with your details

Forgot password? Click here to reset