Compatible features for Monotonic Policy Improvement

10/09/2019
by   Marcin B. Tomczak, et al.
0

Recent policy optimization approaches have achieved substantial empirical success by constructing surrogate optimization objectives. The Approximate Policy Iteration objective (Schulman et al., 2015a; Kakade and Langford, 2002) has become a standard optimization target for reinforcement learning problems. Using this objective in practice requires an estimator of the advantage function. Policy optimization methods such as those proposed in Schulman et al. (2015b) estimate the advantages using a parametric critic. In this work we establish conditions under which the parametric approximation of the critic does not introduce bias to the updates of surrogate objective. These results hold for a general class of parametric policies, including deep neural networks. We obtain a result analogous to the compatible features derived for the original Policy Gradient Theorem (Sutton et al., 1999). As a result, we also identify a previously unknown bias that current state-of-the-art policy optimization algorithms (Schulman et al., 2015a, 2017) have introduced by not employing these compatible features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2018

Relative Entropy Regularized Policy Iteration

We present an off-policy actor-critic algorithm for Reinforcement Learni...
research
01/28/2020

Parameter Sharing in Coagent Networks

In this paper, we aim to prove the theorem that generalizes the Coagent ...
research
10/09/2019

Policy Optimization Through Approximated Importance Sampling

Recent policy optimization approaches (Schulman et al., 2015a, 2017) hav...
research
02/04/2022

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage o...
research
01/29/2020

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

We present GradientDICE for estimating the density ratio between the sta...
research
12/18/2019

Distributional Reinforcement Learning for Energy-Based Sequential Models

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et...
research
06/03/2011

Infinite-Horizon Policy-Gradient Estimation

Gradient-based approaches to direct policy search in reinforcement learn...

Please sign up or login with your details

Forgot password? Click here to reset