A Study of Causal Confusion in Preference-Based Reward Learning

04/13/2022
by   Jeremy Tien, et al.
2

Learning robot policies via preference-based reward learning is an increasingly popular method for customizing robot behavior. However, in recent years, there has been a growing body of anecdotal evidence that learning reward functions from preferences is prone to spurious correlations and reward gaming or hacking behaviors. While there is much anecdotal, empirical, and theoretical analysis of causal confusion and reward gaming behaviors both in reinforcement learning and imitation learning approaches that directly map from states to actions, we provide the first systematic study of causal confusion in the context of learning reward functions from preferences. To facilitate this study, we identify a set of three preference learning benchmark domains where we observe causal confusion when learning from offline datasets of pairwise trajectory preferences: a simple reacher domain, an assistive feeding domain, and an itch-scratching domain. To gain insight into this observed causal confusion, we present a sensitivity analysis that explores the effect of different factors–including the type of training data, reward model capacity, and feature dimensionality–on the robustness of rewards learned from preferences. We find evidence that learning rewards from pairwise trajectory preferences is highly sensitive and non-robust to spurious features and increasing model capacity, but not as sensitive to the type of training data. Videos, code, and supplemental results are available at https://sites.google.com/view/causal-reward-confusion.

READ FULL TEXT

page 1

page 4

page 6

page 7

page 8

research
03/02/2023

Preference Transformer: Modeling Human Preferences using Transformers for RL

Preference-based reinforcement learning (RL) provides a framework to tra...
research
09/06/2023

Everyone Deserves A Reward: Learning Customized Human Preferences

Reward models (RMs) are crucial in aligning large language models (LLMs)...
research
01/09/2023

On The Fragility of Learned Reward Functions

Reward functions are notoriously difficult to specify, especially for ta...
research
06/25/2023

Is RLHF More Difficult than Standard RL?

Reinforcement learning from Human Feedback (RLHF) learns from preference...
research
07/20/2021

Offline Preference-Based Apprenticeship Learning

We study how an offline dataset of prior (possibly random) experience ca...
research
02/27/2023

Active Reward Learning from Online Preferences

Robot policies need to adapt to human preferences and/or new environment...
research
08/18/2023

Learning Reward Machines through Preference Queries over Sequences

Reward machines have shown great promise at capturing non-Markovian rewa...

Please sign up or login with your details

Forgot password? Click here to reset