Data Driven Reward Initialization for Preference based Reinforcement Learning

02/17/2023
by   Mudit Verma, et al.
0

Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model in an attempt to approximate the human's underlying reward function capturing their preferences. In this work, we investigate the issue of a high degree of variability in the initialized reward models which are sensitive to random seeds of the experiment. This further compounds the issue of degenerate reward functions PbRL methods already suffer from. We propose a data-driven reward initialization method that does not add any additional cost to the human in the loop and negligible cost to the PbRL agent and show that doing so ensures that the predicted rewards of the initialized reward model are uniform in the state space and this reduces the variability in the performance of the method across multiple runs and is shown to improve the overall performance compared to other initialization methods.

READ FULL TEXT

page 1

page 3

page 4

research
02/17/2023

Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning

Preference Based Reinforcement Learning has shown much promise for utili...
research
02/17/2023

A State Augmentation based approach to Reinforcement Learning from Human Preferences

Reinforcement Learning has suffered from poor reward specification, and ...
research
09/06/2023

Deep Reinforcement Learning from Hierarchical Weak Preference Feedback

Reward design is a fundamental, yet challenging aspect of practical rein...
research
06/05/2022

Models of human preference for learning reward functions

The utility of reinforcement learning is limited by the alignment of rew...
research
07/08/2023

Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

In recent years, work has gone into developing deep interpretable method...
research
07/19/2023

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

Preference-based reinforcement learning (PbRL) promises to learn a compl...
research
10/19/2022

Scaling Laws for Reward Model Overoptimization

In reinforcement learning from human feedback, it is common to optimize ...

Please sign up or login with your details

Forgot password? Click here to reset