On the Sensitivity of Reward Inference to Misspecified Human Models

12/09/2022
by   Joey Hong, et al.
0

Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.

READ FULL TEXT

page 8

page 16

page 17

research
11/12/2021

Human irrationality: both bad and good for reward inference

Assuming humans are (approximately) rational enables robots to infer rew...
research
11/08/2017

Inverse Reward Design

Autonomous agents optimize the reward function we give them. What they d...
research
11/01/2019

Positive-Unlabeled Reward Learning

Learning reward functions from data is a promising path towards achievin...
research
06/23/2019

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Our goal is for agents to optimize the right reward function, despite ho...
research
12/09/2020

Model-agnostic Fits for Understanding Information Seeking Patterns in Humans

In decision making tasks under uncertainty, humans display characteristi...
research
05/16/2023

Reward Learning with Intractable Normalizing Functions

Robots can learn to imitate humans by inferring what the human is optimi...
research
04/22/2022

The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models

Models of human behavior for prediction and collaboration tend to fall i...

Please sign up or login with your details

Forgot password? Click here to reset