Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic

06/05/2023
by   Tianying Ji, et al.
0

Learning high-quality Q-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that Q-values are indeed underestimated in the latter stage of the RL training process, primarily related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. We hypothesize that this long-neglected phenomenon potentially hinders policy learning and reduces sample efficiency. Our insight to address this issue is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates Q-value using both historical best-performing actions and the current policy. The instantiations of our method in both model-free and model-based settings outperform state-of-the-art methods in various continuous control tasks and achieve strong performance in failure-prone scenarios and real-world robot tasks.

READ FULL TEXT

page 8

page 24

page 28

page 29

page 30

research
03/11/2019

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Value-based reinforcement-learning algorithms are currently state-of-the...
research
06/02/2020

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...
research
06/10/2019

Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement...
research
07/25/2022

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Model-based reinforcement learning (RL) achieves higher sample efficienc...
research
07/16/2018

Remember and Forget for Experience Replay

Experience replay (ER) is crucial for attaining high data-efficiency in ...
research
06/09/2021

Bayesian Bellman Operators

We introduce a novel perspective on Bayesian reinforcement learning (RL)...
research
10/02/2018

Sparse Gaussian Process Temporal Difference Learning for Marine Robot Navigation

We present a method for Temporal Difference (TD) learning that addresses...

Please sign up or login with your details

Forgot password? Click here to reset