On All-Action Policy Gradients

10/24/2022
by   Michal Nauman, et al.
0

In this paper, we analyze the variance of stochastic policy gradient with many action samples per state (all-action SPG). We decompose the variance of SPG and derive an optimality condition for all-action SPG. The optimality condition shows when all-action SPG should be preferred over single-action counterpart and allows to determine a variance-minimizing sampling scheme in SPG estimation. Furthermore, we propose dynamics-all-action (DAA) module, an augmentation that allows for all-action sampling without manipulation of the environment. DAA addresses the problems associated with using a Q-network for all-action sampling and can be readily applied to any on-policy SPG algorithm. We find that using DAA with a canonical on-policy algorithm (PPO) yields better sample efficiency and higher policy returns on a variety of challenging continuous action environments.

READ FULL TEXT
research
10/02/2019

Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

We study the variance of the REINFORCE policy gradient estimator in envi...
research
03/06/2017

Revisiting stochastic off-policy action-value gradients

Off-policy stochastic actor-critic methods rely on approximating the sto...
research
06/13/2018

Marginal Policy Gradients for Complex Control

Many complex domains, such as robotics control and real-time strategy (R...
research
03/20/2018

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Policy gradient methods have enjoyed great success in deep reinforcement...
research
10/21/2019

All-Action Policy Gradient Methods: A Numerical Integration Approach

While often stated as an instance of the likelihood ratio trick [Rubinst...
research
01/26/2023

Joint action loss for proximal policy optimization

PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient...
research
11/06/2021

Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

In reinforcement learning, continuous time is often discretized by a tim...

Please sign up or login with your details

Forgot password? Click here to reset