Samples are not all useful: Denoising policy gradient updates using variance

04/08/2019
by   Yannis Flet-Berliac, et al.
0

Policy gradient algorithms in reinforcement learning rely on efficiently sampling an environment. Most sampling procedures are based solely on sampling the agent's policy. However, other measures made available through these algorithms could be used in order to improve the sampling prior to each policy update. Following this line of thoughts, we propose a method where a transition is used in the gradient update if it meets a particular criterion, and rejected otherwise. This criterion is the fraction of variance explained (V^ex), a measure of the discrepancy between a model and actual samples. V^ex can be used to evaluate the impact each transition will have on the learning. This criterion refines sampling and improves the policy gradient algorithm. In this paper: (1) We introduce and explore V^ex, the selection criterion used to improve the sampling procedure. (2) We conduct experiments across a variety of standard benchmark environments, including continuous control problems. Our results show better performance than if we did not use the V^ex criterion for the policy gradient update. (3) We investigate why V^ex gives a good evaluation for the selection of samples that will positively impact the learning. (4) We show how this criterion can be interpreted as a dynamic way to adjust the ratio between exploration and exploitation.

READ FULL TEXT
research
01/17/2013

Efficient Sample Reuse in Policy Gradients with Parameter-based Exploration

The policy gradient approach is a flexible and powerful reinforcement le...
research
05/20/2022

Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning

One of the major difficulties of reinforcement learning is learning from...
research
02/17/2020

Adaptive Experience Selection for Policy Gradient

Policy gradient reinforcement learning (RL) algorithms have achieved imp...
research
06/08/2018

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Dynamic oracles provide strong supervision for training constituency par...
research
11/03/2022

Theta-Resonance: A Single-Step Reinforcement Learning Method for Design Space Exploration

Given an environment (e.g., a simulator) for evaluating samples in a spe...
research
07/12/2022

Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning

In lifelong learning, an agent learns throughout its entire life without...
research
06/04/2022

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Neural replicator dynamics (NeuRD) is an alternative to the foundational...

Please sign up or login with your details

Forgot password? Click here to reset