A Temporal-Difference Approach to Policy Gradient Estimation

02/04/2022
by   Samuele Tosatto, et al.
6

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2021

Model-free Policy Learning with Reward Gradients

Policy gradient methods estimate the gradient of a policy objective sole...
research
11/22/2018

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Policy gradient methods are widely used for control in reinforcement lea...
research
01/28/2020

Parameter Sharing in Coagent Networks

In this paper, we aim to prove the theorem that generalizes the Coagent ...
research
01/03/2017

A K-fold Method for Baseline Estimation in Policy Gradient Algorithms

The high variance issue in unbiased policy-gradient methods such as VPG ...
research
10/09/2019

Compatible features for Monotonic Policy Improvement

Recent policy optimization approaches have achieved substantial empirica...
research
02/19/2018

Fourier Policy Gradients

We propose a new way of deriving policy gradient updates for reinforceme...
research
12/22/2021

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators for softmax policies are ineffective wit...

Please sign up or login with your details

Forgot password? Click here to reset