Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning

11/24/2019
by   Gang Chen, et al.
0

Deep reinforcement learning (DRL) on Markov decision processes (MDPs) with continuous action spaces is often approached by directly updating parametric policies along the direction of estimated policy gradients (PGs). Previous research revealed that the performance of these PG algorithms depends heavily on the bias-variance tradeoff involved in estimating and using PGs. A notable approach towards balancing this tradeoff is to merge both on-policy and off-policy gradient estimations for the purpose of training stochastic policies. However this method cannot be utilized directly by sample-efficient off-policy PG algorithms such as Deep Deterministic Policy Gradient (DDPG) and twin-delayed DDPG (TD3), which have been designed to train deterministic policies. It is hence important to develop new techniques to merge multiple off-policy estimations of deterministic PG (DPG). Driven by this research question, this paper introduces elite DPG which will be estimated differently from conventional DPG to emphasize on the variance reduction effect at the expense of increased learning bias. To mitigate the extra bias, policy consolidation techniques will be developed to distill policy behavioral knowledge from elite trajectories and use the distilled generative model to further regularize policy training. Moreover, we will study both theoretically and experimentally two different DPG merging methods, i.e., interpolation merging and two-step merging, with the aim to induce varied bias-variance tradeoff through combined use of both conventional DPG and elite DPG. Experiments on six benchmark control tasks confirm that these two merging methods can noticeably improve the learning performance of TD3, significantly outperforming several state-of-the-art DRL algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2017

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Off-policy model-free deep reinforcement learning methods using previous...
research
01/20/2023

Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

We revisit the estimation bias in policy gradients for the discounted ep...
research
01/12/2022

Evolutionary Action Selection for Gradient-based Policy Learning

Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) have...
research
09/13/2021

Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods

When designing algorithms for finite-time-horizon episodic reinforcement...
research
01/10/2018

Expected Policy Gradients for Reinforcement Learning

We propose expected policy gradients (EPG), which unify stochastic polic...
research
05/18/2023

Deep Metric Tensor Regularized Policy Gradient

Policy gradient algorithms are an important family of deep reinforcement...
research
09/26/2022

Learning GFlowNets from partial episodes for improved convergence and stability

Generative flow networks (GFlowNets) are a family of algorithms for trai...

Please sign up or login with your details

Forgot password? Click here to reset