Directed Policy Gradient for Safe Reinforcement Learning with Human Advice

08/13/2018
by   Hélène Plisnier, et al.
0

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient. Our technique, Directed Policy Gradient (DPG), allows a teacher or backup policy to override the agent before it acts undesirably, while allowing the agent to leverage human advice or directives to learn faster. Our experiments demonstrate that DPG makes the agent learn much faster than reward-based approaches, while requiring an order of magnitude less advice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Fluid human-agent communication is essential for the future of human-in-...
research
04/17/2019

PLOTS: Procedure Learning from Observations using Subtask Structure

In many cases an intelligent agent may want to learn how to mimic a sing...
research
06/21/2019

A Study of State Aliasing in Structured Prediction with RNNs

End-to-end reinforcement learning agents learn a state representation an...
research
08/04/2023

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Recent months have seen the emergence of a powerful new trend in which l...
research
02/06/2021

A Hybrid Approach for Reinforcement Learning Using Virtual Policy Gradient for Balancing an Inverted Pendulum

Using the policy gradient algorithm, we train a single-hidden-layer neur...
research
05/23/2018

Deep Reinforcement Learning of Marked Temporal Point Processes

In a wide variety of applications, humans interact with a complex enviro...
research
06/30/2021

Inverse Design of Grating Couplers Using the Policy Gradient Method from Reinforcement Learning

We present a proof-of-concept technique for the inverse design of electr...

Please sign up or login with your details

Forgot password? Click here to reset