Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

11/12/2022
by   Katherine Metcalf, et al.
0

Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks. We hypothesize that REED-based methods better partition the state-action space and facilitate generalization to state-action pairs not included in the preference dataset. REED iterates between encoding environment dynamics in a state-action representation via a self-supervised temporal consistency task, and bootstrapping the preference-based reward function from the state-action representation. Whereas prior approaches train only on the preference-labelled trajectory pairs, REED exposes the state-action representation to all transitions experienced during policy training. We explore the benefits of REED within the PrefPPO [1] and PEBBLE [2] preference learning frameworks and demonstrate improvements across experimental conditions to both the speed of policy learning and the final policy performance. For example, on quadruped-walk and walker-walk with 50 preference labels, REED-based reward functions recover 83 truth reward policy performance and without REED only 38% and 21% are recovered. For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.

READ FULL TEXT
research
09/06/2023

Deep Reinforcement Learning from Hierarchical Weak Preference Feedback

Reward design is a fundamental, yet challenging aspect of practical rein...
research
10/17/2022

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

Specifying rewards for reinforcement learned (RL) agents is challenging....
research
01/11/2023

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models

Preference-based reinforcement learning (PbRL) can enable robots to lear...
research
10/19/2022

Scaling Laws for Reward Model Overoptimization

In reinforcement learning from human feedback, it is common to optimize ...
research
07/24/2023

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

A crucial task in decision-making problems is reward engineering. It is ...
research
01/08/2023

Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Many real-world reinforcement learning (RL) problems necessitate learnin...
research
06/06/2023

Zero-shot Preference Learning for Offline RL via Optimal Transport

Preference-based Reinforcement Learning (PbRL) has demonstrated remarkab...

Please sign up or login with your details

Forgot password? Click here to reset