Preference Ranking Optimization for Human Alignment

06/30/2023
by   Feifan Song, et al.
0

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secur AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2023

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignm...
research
09/13/2023

Statistical Rejection Sampling Improves Preference Optimization

Improving the alignment of language models with human preferences remain...
research
08/10/2023

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal ro...
research
03/07/2023

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

In this paper, we focus on a novel optimization problem in which the obj...
research
07/24/2023

RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment

We propose Reinforcement Learning from Contrast Distillation (RLCD), a m...
research
05/28/2023

Reward Collapse in Aligning Large Language Models

The extraordinary capabilities of large language models (LLMs) such as C...
research
08/30/2022

Towards Boosting the Open-Domain Chatbot with Human Feedback

Many open-domain dialogue models pre-trained with social media comments ...

Please sign up or login with your details

Forgot password? Click here to reset