On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

01/28/2022
by   Amrit Singh Bedi, et al.
2

We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.

READ FULL TEXT
research
06/15/2021

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Reinforcement learning is a framework for interactive decision-making wi...
research
02/20/2021

On Proximal Policy Optimization's Heavy-tailed Gradients

Modern policy gradient algorithms, notably Proximal Policy Optimization ...
research
06/12/2022

Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies

In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradien...
research
11/03/2021

Proximal Policy Optimization with Continuous Bounded Action Space via the Beta Distribution

Reinforcement learning methods for continuous control tasks have evolved...
research
02/11/2021

Robust Policy Gradient against Strong Data Corruption

We study the problem of robust reinforcement learning under adversarial ...
research
11/21/2017

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Policy optimization methods have shown great promise in solving complex ...
research
04/28/2019

Learning walk and trot from the same objective using different types of exploration

In quadruped gait learning, policy search methods that scale high dimens...

Please sign up or login with your details

Forgot password? Click here to reset