Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem

07/09/2018
by   Hugo Penedones, et al.
0

Temporal-Difference learning (TD) [Sutton, 1988] with function approximation can converge to solutions that are worse than those obtained by Monte-Carlo regression, even in the simple case of on-policy evaluation. To increase our understanding of the problem, we investigate the issue of approximation errors in areas of sharp discontinuities of the value function being further propagated by bootstrap updates. We show empirical evidence of this leakage propagation, and show analytically that it must occur, in a simple Markov chain, when function approximation errors are present. For reversible policies, the result can be interpreted as the tension between two terms of the loss function that TD minimises, as recently described by [Ollivier, 2018]. We show that the upper bounds from [Tsitsiklis and Van Roy, 1997] hold, but they do not imply that leakage propagation occurs and under what conditions. Finally, we test whether the problem could be mitigated with a better state representation, and whether it can be learned in an unsupervised manner, without rewards or privileged information.

READ FULL TEXT
research
06/19/2019

Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

We consider the core reinforcement-learning problem of on-policy value f...
research
05/02/2018

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

In reinforcement learning, temporal difference (TD) is the most direct a...
research
09/26/2022

Some Sharp Error Bounds for Multivariate Linear Interpolation and Extrapolation

We study in this paper the function approximation error of linear interp...
research
02/23/2017

Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively stu...
research
11/03/2019

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Motivated by the emerging use of multi-agent reinforcement learning (MAR...
research
10/24/2020

An Adiabatic Theorem for Policy Tracking with TD-learning

We evaluate the ability of temporal difference learning to track the rew...
research
10/13/2021

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...

Please sign up or login with your details

Forgot password? Click here to reset