Improving Policy Gradient by Exploring Under-appreciated Rewards

11/28/2016
by   Ofir Nachum, et al.
0

This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/15/2019

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Model-free reinforcement learning algorithms such as Deep Deterministic ...
research
02/21/2020

Accelerating Reinforcement Learning with a Directional-Gaussian-Smoothing Evolution Strategy

Evolution strategy (ES) has been shown great promise in many challenging...
research
10/08/2020

Learning Intrinsic Symbolic Rewards in Reinforcement Learning

Learning effective policies for sparse objectives is a key challenge in ...
research
03/09/2021

Model-free Policy Learning with Reward Gradients

Policy gradient methods estimate the gradient of a policy objective sole...
research
07/20/2023

Reparameterized Policy Learning for Multimodal Trajectory Optimization

We investigate the challenge of parametrizing policies for reinforcement...
research
06/02/2023

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space

We consider the reinforcement learning (RL) problem with general utiliti...
research
08/31/2018

Directed Exploration in PAC Model-Free Reinforcement Learning

We study an exploration method for model-free RL that generalizes the co...

Please sign up or login with your details

Forgot password? Click here to reset