DeepAI AI Chat
Log In Sign Up

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

by   Mingxuan Jing, et al.
Tsinghua University

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufficient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we first define an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical findings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efficiently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.


page 1

page 2

page 3

page 4


Pretrain Soft Q-Learning with Imperfect Demonstrations

Pretraining reinforcement learning methods with demonstrations has been ...

Bayesian Q-learning With Imperfect Expert Demonstrations

Guided exploration with expert demonstrations improves data efficiency f...

Policy Learning Using Weak Supervision

Most existing policy learning solutions require the learning agents to r...

Learning from Demonstration without Demonstrations

State-of-the-art reinforcement learning (RL) algorithms suffer from high...

Guarded Policy Optimization with Imperfect Online Demonstrations

The Teacher-Student Framework (TSF) is a reinforcement learning setting ...

Hierarchical Deep Q-Network with Forgetting from Imperfect Demonstrations in Minecraft

We present hierarchical Deep Q-Network with Forgetting (HDQF) that took ...

Marginal MAP Estimation for Inverse RL under Occlusion with Observer Noise

We consider the problem of learning the behavioral preferences of an exp...