Preferential Temporal Difference Learning

06/11/2021
by   Nishanth Anand, et al.
17

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

READ FULL TEXT
research
08/15/2020

Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of mode...
research
01/30/2023

On the Statistical Benefits of Temporal Difference Learning

Given a dataset on actions and resulting long-term rewards, a direct est...
research
10/13/2021

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...
research
01/25/2018

Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

This paper investigates estimating the variance of a temporal-difference...
research
11/22/2021

Optimistic Temporal Difference Learning for 2048

Temporal difference (TD) learning and its variants, such as multistage T...
research
06/16/2020

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Temporal-Difference (TD) learning is a standard and very successful rein...
research
04/08/2022

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...

Please sign up or login with your details

Forgot password? Click here to reset