Learning to Explore with Meta-Policy Gradient

03/13/2018
by   Tianbing Xu, et al.
0

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore local regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a global exploration that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning tasks.

READ FULL TEXT
research
10/30/2017

Sample-efficient Policy Optimization with Stein Control Variate

Policy gradient methods have achieved remarkable successes in solving ch...
research
11/15/2019

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Model-free reinforcement learning algorithms such as Deep Deterministic ...
research
05/08/2019

Smoothing Policies and Safe Policy Gradients

Policy gradient algorithms are among the best candidates for the much an...
research
10/20/2020

Survivable Hyper-Redundant Robotic Arm with Bayesian Policy Morphing

In this paper we present a Bayesian reinforcement learning framework tha...
research
07/06/2018

Memory Augmented Policy Optimization for Program Synthesis with Generalization

This paper presents Memory Augmented Policy Optimization (MAPO): a novel...
research
07/14/2020

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Policy gradient methods have shown success in learning control policies ...
research
03/05/2018

Learning Sample-Efficient Target Reaching for Mobile Robots

In this paper, we propose a novel architecture and a self-supervised pol...

Please sign up or login with your details

Forgot password? Click here to reset