Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

07/24/2023
by   Xiang Ji, et al.
0

A crucial task in decision-making problems is reward engineering. It is common in practice that no obvious choice of reward function exists. Thus, a popular approach is to introduce human feedback during training and leverage such feedback to learn a reward function. Among all policy learning methods that use human feedback, preference-based methods have demonstrated substantial success in recent empirical applications such as InstructGPT. In this work, we develop a theory that provably shows the benefits of preference-based methods in offline contextual bandits. In particular, we improve the modeling and suboptimality analysis for running policy learning methods on human-scored samples directly. Then, we compare it with the suboptimality guarantees of preference-based methods and show that preference-based methods enjoy lower suboptimality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Beyond Reward: Offline Preference-guided Policy Optimization

This study focuses on the topic of offline preference-based reinforcemen...
research
05/24/2023

Inverse Preference Learning: Preference-based RL without a Reward Function

Reward functions are difficult to design and often hard to align with hu...
research
11/12/2022

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

Preference-based reinforcement learning (RL) algorithms help avoid the p...
research
06/05/2022

Models of human preference for learning reward functions

The utility of reinforcement learning is limited by the alignment of rew...
research
08/08/2022

POLAR: Preference Optimization and Learning Algorithms for Robotics

Parameter tuning for robotic systems is a time-consuming and challenging...
research
03/02/2023

Active Reward Learning from Multiple Teachers

Reward learning algorithms utilize human feedback to infer a reward func...
research
03/18/2022

SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

Preference-based reinforcement learning (RL) has shown potential for tea...

Please sign up or login with your details

Forgot password? Click here to reset