Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

01/20/2023
by   Haoxuan Pan, et al.
0

We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.

READ FULL TEXT

page 7

page 12

research
11/24/2019

Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning

Deep reinforcement learning (DRL) on Markov decision processes (MDPs) wi...
research
08/29/2018

Deep Reinforcement Learning in Portfolio Management

In this paper, we implement two state-of-art continuous reinforcement le...
research
11/16/2019

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of ...
research
03/19/2020

Robust Deep Reinforcement Learning against Adversarial Perturbations on Observations

Deep Reinforcement Learning (DRL) is vulnerable to small adversarial per...
research
09/15/2022

Multi-Objective Policy Gradients with Topological Constraints

Multi-objective optimization models that encode ordered sequential const...
research
07/17/2021

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approxi...
research
03/06/2018

Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Careful tuning of the learning rate, or even schedules thereof, can be c...

Please sign up or login with your details

Forgot password? Click here to reset