Inverse Reward Design

11/08/2017
by   Dylan Hadfield-Menell, et al.
0

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.

READ FULL TEXT

page 5

page 6

research
09/09/2018

Active Inverse Reward Design

Reward design, the problem of selecting an appropriate reward function f...
research
01/29/2021

Challenges for Using Impact Regularizers to Avoid Negative Side Effects

Designing reward functions for reinforcement learning is difficult: besi...
research
03/22/2021

Combining Reward Information from Multiple Sources

Given two sources of evidence about a latent variable, one can combine t...
research
12/09/2022

On the Sensitivity of Reward Inference to Misspecified Human Models

Inferring reward functions from human behavior is at the center of value...
research
04/16/2018

Distribution Estimation in Discounted MDPs via a Transformation

Although the general deterministic reward function in MDPs takes three a...
research
06/07/2018

Simplifying Reward Design through Divide-and-Conquer

Designing a good reward function is essential to robot planning and rein...
research
06/23/2019

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Our goal is for agents to optimize the right reward function, despite ho...

Please sign up or login with your details

Forgot password? Click here to reset