Deep Reinforcement Learning from Policy-Dependent Human Feedback

02/12/2019
by   Dilip Arumugam, et al.
18

To widen their accessibility and increase their utility, intelligent agents must be able to learn complex behaviors as specified by (non-expert) human users. Moreover, they will need to learn these behaviors within a reasonable amount of time while efficiently leveraging the sparse feedback a human trainer is capable of providing. Recent work has shown that human feedback can be characterized as a critique of an agent's current behavior rather than as an alternative reward signal to be maximized, culminating in the COnvergent Actor-Critic by Humans (COACH) algorithm for making direct policy updates based on human feedback. Our work builds on COACH, moving to a setting where the agent's policy is represented by a deep neural network. We employ a series of modifications on top of the original COACH algorithm that are critical for successfully learning behaviors from high-dimensional observations, while also satisfying the constraint of obtaining reduced sample complexity. We demonstrate the effectiveness of our Deep COACH algorithm in the rich 3D world of Minecraft with an agent that learns to complete tasks by mapping from raw pixels to actions using only real-time human feedback in 10-15 minutes of interaction.

READ FULL TEXT
research
09/15/2021

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Fluid human-agent communication is essential for the future of human-in-...
research
01/21/2017

Interactive Learning from Policy-Dependent Human Feedback

For agents and robots to become more useful, they must be able to quickl...
research
09/28/2017

Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces

While recent advances in deep reinforcement learning have allowed autono...
research
02/13/2021

Mitigating Negative Side Effects via Environment Shaping

Agents operating in unstructured environments often produce negative sid...
research
04/08/2021

Learning What To Do by Simulating the Past

Since reward functions are hard to specify, recent work has focused on l...
research
02/05/2022

ASHA: Assistive Teleoperation via Human-in-the-Loop Reinforcement Learning

Building assistive interfaces for controlling robots through arbitrary, ...
research
10/05/2020

Mastering Atari with Discrete World Models

Intelligent agents need to generalize from past experience to achieve go...

Please sign up or login with your details

Forgot password? Click here to reset