Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

06/19/2019
by   Hugo Penedones, et al.
1

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.

READ FULL TEXT

page 3

page 20

research
01/31/2023

Improving Monte Carlo Evaluation with Offline Data

Monte Carlo (MC) methods are the most widely used methods to estimate th...
research
07/09/2018

Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem

Temporal-Difference learning (TD) [Sutton, 1988] with function approxima...
research
07/27/2020

Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation

In reinforcement learning, it is typical to use the empirically observed...
research
12/04/2020

MCMC Confidence Intervals and Biases

The recent paper "Simple confidence intervals for MCMC without CLTs" by ...
research
02/27/2020

ConQUR: Mitigating Delusional Bias in Deep Q-learning

Delusional bias is a fundamental source of error in approximate Q-learni...
research
02/15/2019

Monte Carlo Sampling Bias in the Microwave Uncertainty Framework

Uncertainty propagation software can have unknown, inadvertent biases in...
research
06/20/2016

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

For an autonomous agent, executing a poor policy may be costly or even d...

Please sign up or login with your details

Forgot password? Click here to reset