Analyzing Influential Factors in Human Preference Judgments via GPT-4

05/24/2023
by   Yebowen Hu, et al.
0

Pairwise human judgments are pivotal in guiding large language models (LLMs) to generate outputs that align with human preferences. They are also often used in summarization evaluation, complementing existing automatic metrics. Despite their significance, however, there has been limited research probing these pairwise human judgments. The collective impact and respective weights of factors such as informativeness, coherence, fluency, and factual consistency remain elusive. The impact of hidden factors on the final judgment is also unclear. In this paper, we conduct an in-depth examination of a dataset of pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce model, we identify key factors that could potentially influence human judgments. Our research uncovers the inherent preferences embedded in human judgments and suggests strategies to boost sample efficiency. Finally, we provide insights on the construction of balanced datasets for human judgment evaluations, a crucial step in shaping the behaviors of future LLMs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Human evaluation is the foundation upon which the evaluation of both sum...
research
04/05/2023

Human-like Summarization Evaluation with ChatGPT

Evaluating text summarization is a challenging problem, and existing eva...
research
03/16/2023

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Incorporating human feedback has been shown to be crucial to align text ...
research
07/06/2023

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

Nowadays, the quality of responses generated by different modern large l...
research
03/25/2023

Better Aligning Text-to-Image Models with Human Preference

Recent years have witnessed a rapid growth of deep generative models, wi...
research
09/01/2023

Reinforcement Learning with Human Feedback for Realistic Traffic Simulation

In light of the challenges and costs of real-world testing, autonomous v...
research
06/01/2023

Preference-grounded Token-level Guidance for Language Model Fine-tuning

Aligning language models (LMs) with preferences is an important problem ...

Please sign up or login with your details

Forgot password? Click here to reset