Learning Human Objectives by Evaluating Hypothetical Behavior

12/05/2019
by   Siddharth Reddy, et al.
20

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

READ FULL TEXT

page 9

page 17

page 18

research
12/30/2019

A New Framework for Query Efficient Active Imitation Learning

We seek to align agent policy with human expert behavior in a reinforcem...
research
01/29/2018

Learning the Reward Function for a Misspecified Model

In model-based reinforcement learning it is typical to treat the problem...
research
12/21/2020

Difference Rewards Policy Gradients

Policy gradient methods have become one of the most popular classes of a...
research
04/13/2021

Subgoal-based Reward Shaping to Improve Efficiency in Reinforcement Learning

Reinforcement learning, which acquires a policy maximizing long-term rew...
research
12/11/2019

SMiRL: Surprise Minimizing RL in Dynamic Environments

All living organisms struggle against the forces of nature to carve out ...
research
12/21/2020

Evaluating Agents without Rewards

Reinforcement learning has enabled agents to solve challenging tasks in ...
research
01/10/2018

Neural Program Synthesis with Priority Queue Training

We consider the task of program synthesis in the presence of a reward fu...

Please sign up or login with your details

Forgot password? Click here to reset