DeepAI AI Chat
Log In Sign Up

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

by   Tadashi Kozuno, et al.

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q(λ) in complex continuous control tasks, confirming that Peng's Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.


page 8

page 25


Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

This paper investigates conservative exploration in reinforcement learni...

Conservative Exploration in Reinforcement Learning

While learning in an unknown Markov Decision Process (MDP), an agent sho...

Deep Conservative Policy Iteration

Conservative Policy Iteration (CPI) is a founding algorithm of Approxima...

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

We consider the infinite-horizon discounted optimal control problem form...

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Provably efficient Model-Based Reinforcement Learning (MBRL) based on op...

Soft-Robust Algorithms for Handling Model Misspecification

In reinforcement learning, robust policies for high-stakes decision-maki...

Quantification before Selection: Active Dynamics Preference for Robust Reinforcement Learning

Training a robust policy is critical for policy deployment in real-world...