An Alternate Policy Gradient Estimator for Softmax Policies

12/22/2021
by   Shivam Garg, et al.
6

Policy gradient (PG) estimators for softmax policies are ineffective with sub-optimally saturated initialization, which happens when the density concentrates on a sub-optimal action. Sub-optimal policy saturation may arise from bad policy initialization or sudden changes in the environment that occur after the policy has already converged, and softmax PG estimators require a large number of updates to recover an effective policy. This severe issue causes high sample inefficiency and poor adaptability to new situations. To mitigate this problem, we propose a novel policy gradient estimator for softmax policies that utilizes the bias in the critic estimate and the noise present in the reward signal to escape the saturated regions of the policy parameter space. Our analysis and experiments, conducted on bandits and classical MDP benchmarking tasks, show that our estimator is more robust to policy saturation.

READ FULL TEXT
research
02/21/2018

Clipped Action Policy Gradient

Many continuous control tasks have bounded action spaces and clip out-of...
research
05/13/2020

On the Global Convergence Rates of Softmax Policy Gradient Methods

We make three contributions toward better understanding policy gradient ...
research
03/09/2021

Model-free Policy Learning with Reward Gradients

Policy gradient methods estimate the gradient of a policy objective sole...
research
06/15/2022

Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

We study the performance of policy gradient methods for the subclass of ...
research
02/22/2021

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

The softmax policy gradient (PG) method, which performs gradient ascent ...
research
02/04/2022

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage o...
research
08/12/2021

A functional mirror ascent view of policy gradient methods with function approximation

We use functional mirror ascent to propose a general framework (referred...

Please sign up or login with your details

Forgot password? Click here to reset