Reward-Conditioned Policies

12/31/2019
by   Aviral Kumar, et al.
0

Reinforcement learning offers the promise of automating the acquisition of complex behavioral skills. However, compared to commonly used and well-understood supervised learning methods, reinforcement learning algorithms can be brittle, difficult to use and tune, and sensitive to seemingly innocuous implementation decisions. In contrast, imitation learning utilizes standard and well-understood supervised learning methods, but requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? The main idea that we explore in this work is that non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. By then conditioning the policy on the numerical value of the reward, we can obtain a policy that generalizes to larger returns. We show how such an approach can be derived as a principled method for policy search, discuss several variants, and compare the method experimentally to a variety of current reinforcement learning methods on standard benchmarks.

READ FULL TEXT
research
12/12/2019

Learning To Reach Goals Without Reinforcement Learning

Imitation learning algorithms provide a simple and straightforward appro...
research
05/19/2022

IL-flOw: Imitation Learning from Observation using Normalizing Flows

We present an algorithm for Inverse Reinforcement Learning (IRL) from ex...
research
12/08/2021

Application of Deep Reinforcement Learning to Payment Fraud

The large variety of digital payment choices available to consumers toda...
research
05/29/2018

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

In this paper, we propose to combine imitation and reinforcement learnin...
research
06/14/2021

Targeted Data Acquisition for Evolving Negotiation Agents

Successful negotiators must learn how to balance optimizing for self-int...
research
09/26/2022

Understanding Hindsight Goal Relabeling Requires Rethinking Divergence Minimization

Hindsight goal relabeling has become a foundational technique for multi-...
research
12/08/2022

Model-based trajectory stitching for improved behavioural cloning and its applications

Behavioural cloning (BC) is a commonly used imitation learning method to...

Please sign up or login with your details

Forgot password? Click here to reset