Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

11/24/2021
by   Nicolai Dorka, et al.
3

Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics, which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. Additionally, we demonstrate that ACC is quite general by further applying it to TD3 and showing an improved performance also in this setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2021

AWD3: Dynamic Reduction of the Estimation Bias

Value-based deep Reinforcement Learning (RL) algorithms suffer from the ...
research
10/07/2021

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Popular off-policy deep reinforcement learning algorithms compensate for...
research
05/08/2020

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

The overestimation bias is one of the major impediments to accurate off-...
research
02/16/2022

On a Variance Reduction Correction of the Temporal Difference for Policy Evaluation in the Stochastic Continuous Setting

This paper deals with solving continuous time, state and action optimiza...
research
12/30/2016

Adaptive Lambda Least-Squares Temporal Difference Learning

Temporal Difference learning or TD(λ) is a fundamental algorithm in the ...
research
09/24/2021

Parameter-Free Deterministic Reduction of the Estimation Bias in Continuous Control

Approximation of the value functions in value-based deep reinforcement l...
research
06/04/2022

Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning

Q(σ) is a recently proposed temporal-difference learning method that int...

Please sign up or login with your details

Forgot password? Click here to reset