Correcting discount-factor mismatch in on-policy policy gradient methods

06/23/2023
by   Fengdi Che, et al.
0

The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the discounted stationary distribution. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using γ^t as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction circumvents the performance degradation associated with the γ^t correction with a lower variance. Importantly, compared to the uncorrected estimators, our algorithm provides improved state emphasis to evade suboptimal policies in certain environments and consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2019

Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decisio...
research
03/09/2021

Model-free Policy Learning with Reward Gradients

Policy gradient methods estimate the gradient of a policy objective sole...
research
07/07/2020

Off-Policy Evaluation via the Regularized Lagrangian

The recently proposed distribution correction estimation (DICE) family o...
research
12/10/2022

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

We revisit the domain of off-policy policy optimization in RL from the p...
research
12/11/2019

Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods

The policy gradient theorem is defined based on an objective with respec...
research
11/06/2021

Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

In reinforcement learning, continuous time is often discretized by a tim...
research
07/04/2021

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Tree Search (TS) is crucial to some of the most influential successes in...

Please sign up or login with your details

Forgot password? Click here to reset