Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

05/02/2018
by   Yann Ollivier, et al.
0

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with nonlinear parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is reversible. This holds even with nonlinear approximations. A policy with transition probabilities P(s,s') between states is reversible if there exists a function μ over states such that P(s,s')/P(s',s)=μ(s')/μ(s). In particular, every move can be undone with some probability. This condition is restrictive; it is satisfied, for instance, for a navigation problem in any unoriented graph. In this case, approximate TD is exactly a gradient descent of the Dirichlet norm, the norm of the difference of gradients between the true and approximate value functions. The Dirichlet norm also controls the bias of approximate policy gradient. These results hold even with no decay factor (γ=1) and do not rely on contractivity of the Bellman operator, thus proving stability of TD even with γ=1 for reversible policies.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset