-
Stochastic Variance Reduction for Policy Gradient Estimation
Recent advances in policy gradient methods and deep learning have demons...
read it
-
Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning
Deep reinforcement learning (DRL) on Markov decision processes (MDPs) wi...
read it
-
Quantile Regression Deep Reinforcement Learning
Policy gradient based reinforcement learning algorithms coupled with neu...
read it
-
A Nonparametric Offpolicy Policy Gradient
Reinforcement learning (RL) algorithms still suffer from high sample com...
read it
-
Divide-and-Conquer Reinforcement Learning
Standard model-free deep reinforcement learning (RL) algorithms sample a...
read it
-
Policy-Aware Model Learning for Policy Gradient Methods
This paper considers the problem of learning a model in model-based rein...
read it
-
Representations for Stable Off-Policy Reinforcement Learning
Reinforcement learning with function approximation can be unstable and e...
read it
Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.
READ FULL TEXT
Comments
There are no comments yet.