Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

05/02/2018
by   Yann Ollivier, et al.
0

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with nonlinear parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is reversible. This holds even with nonlinear approximations. A policy with transition probabilities P(s,s') between states is reversible if there exists a function μ over states such that P(s,s')/P(s',s)=μ(s')/μ(s). In particular, every move can be undone with some probability. This condition is restrictive; it is satisfied, for instance, for a navigation problem in any unoriented graph. In this case, approximate TD is exactly a gradient descent of the Dirichlet norm, the norm of the difference of gradients between the true and approximate value functions. The Dirichlet norm also controls the bias of approximate policy gradient. These results hold even with no decay factor (γ=1) and do not rely on contractivity of the Bellman operator, thus proving stability of TD even with γ=1 for reversible policies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2022

Gradient Descent Temporal Difference-difference Learning

Off-policy algorithms, in which a behavior policy differs from the targe...
research
01/18/2021

Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint

In reinforcement learning, temporal difference-based algorithms can be s...
research
07/09/2018

Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem

Temporal-Difference learning (TD) [Sutton, 1988] with function approxima...
research
06/15/2017

Reinforcement Learning under Model Mismatch

We study reinforcement learning under model misspecification, where we d...
research
05/29/2019

On the Expected Dynamics of Nonlinear TD Learning

While there are convergence guarantees for temporal difference (TD) lear...
research
12/03/2022

Smoothing Policy Iteration for Zero-sum Markov Games

Zero-sum Markov Games (MGs) has been an efficient framework for multi-ag...
research
04/08/2022

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...

Please sign up or login with your details

Forgot password? Click here to reset