Reducing Sampling Error in Batch Temporal Difference Learning

08/15/2020
by   Brahma Pavse, et al.
6

Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch – not the true probability of the action under the given policy. To address this limitation, we introduce policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).

READ FULL TEXT

page 7

page 28

research
04/17/2017

O^2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most ...
research
06/11/2021

Preferential Temporal Difference Learning

Temporal-Difference (TD) learning is a general and very useful tool for ...
research
07/04/2021

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Tree Search (TS) is crucial to some of the most influential successes in...
research
09/26/2013

Batch-iFDD for Representation Expansion in Large MDPs

Matching pursuit (MP) methods are a promising class of feature construct...
research
01/28/2022

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

In this work, we study the use of the Bellman equation as a surrogate ob...
research
03/09/2020

Q^ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

We prove performance guarantees of two algorithms for approximating Q^ i...
research
01/30/2023

On the Statistical Benefits of Temporal Difference Learning

Given a dataset on actions and resulting long-term rewards, a direct est...

Please sign up or login with your details

Forgot password? Click here to reset