Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

03/23/2021
by   Benjamin Eysenbach, et al.
0

In the standard Markov decision process formalism, users specify tasks by writing down a reward function. However, in many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm from first principles that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states. Prior work has approached similar problem settings in a two-stage process, first learning an auxiliary reward function and then optimizing this reward function using another reinforcement learning algorithm. In contrast, we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.

READ FULL TEXT

Authors

page 7

page 8

page 18

05/01/2021

Markov Rewards Processes with Impulse Rewards and Absorbing States

We study the expected accumulated reward for a discrete-time Markov rewa...
06/06/2019

An Extensible Interactive Interface for Agent Design

In artificial intelligence, we often specify tasks through a reward func...
01/29/2018

Learning the Reward Function for a Misspecified Model

In model-based reinforcement learning it is typical to treat the problem...
09/11/2019

Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning

A common approach for defining a reward function for Multi-objective Rei...
01/26/2020

Constrained Upper Confidence Reinforcement Learning

Constrained Markov Decision Processes are a class of stochastic decision...
09/19/2017

Incorrigibility in the CIRL Framework

A value learning system has incentives to follow shutdown instructions, ...
02/12/2020

Reward-rational (implicit) choice: A unifying formalism for reward learning

It is often difficult to hand-specify what the correct reward function i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.