Active Inverse Reward Design

09/09/2018
by   Sören Mindermann, et al.
0

Reward design, the problem of selecting an appropriate reward function for an AI system, is both critically important, as it encodes the task the system should perform, and challenging, as it requires reasoning about and understanding the agent's environment in detail. AI practitioners often iterate on the reward function for their systems in a trial-and-error process to get their desired behavior. Inverse reward design (IRD) is a preference inference method that infers a true reward function from an observed, possibly misspecified, proxy reward function. This allows the system to determine when it should trust its observed reward function and respond appropriately. This has been shown to avoid problems in reward design such as negative side-effects (omitting a seemingly irrelevant but important aspect of the task) and reward hacking (learning to exploit unanticipated loopholes). In this paper, we actively select the set of proxy reward functions available to the designer. This improves the quality of inference and simplifies the associated reward design problem. We present two types of queries: discrete queries, where the system designer chooses from a discrete set of reward functions, and feature queries, where the system queries the designer for weights on a small set of features. We evaluate this approach with experiments in a personal shopping assistant domain and a 2D navigation domain. We find that our approach leads to reduced regret at test time compared with vanilla IRD. Our results indicate that actively selecting the set of available reward functions is a promising direction to improve the efficiency and effectiveness of reward design.

READ FULL TEXT

page 2

page 4

research
11/08/2017

Inverse Reward Design

Autonomous agents optimize the reward function we give them. What they d...
research
04/28/2020

Pitfalls of learning a reward function online

In some agent designs like inverse reinforcement learning an agent needs...
research
01/29/2021

Challenges for Using Impact Regularizers to Avoid Negative Side Effects

Designing reward functions for reinforcement learning is difficult: besi...
research
09/27/2022

Defining and Characterizing Reward Hacking

We provide the first formal definition of reward hacking, a phenomenon w...
research
06/07/2018

Simplifying Reward Design through Divide-and-Conquer

Designing a good reward function is essential to robot planning and rein...
research
11/18/2021

Assisted Robust Reward Design

Real-world robotic tasks require complex reward functions. When we defin...
research
11/24/2022

Discovering Generalizable Spatial Goal Representations via Graph-based Active Reward Learning

In this work, we consider one-shot imitation learning for object rearran...

Please sign up or login with your details

Forgot password? Click here to reset