Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint

01/18/2021
by   Léonard Blier, et al.
0

In reinforcement learning, temporal difference-based algorithms can be sample-inefficient: for instance, with sparse rewards, no learning occurs until a reward is observed. This can be remedied by learning richer objects, such as a model of the environment, or successor states. Successor states model the expected future state occupancy from any given state for a given policy and are related to goal-dependent value functions, which learn how to reach arbitrary states. We formally derive the temporal difference algorithm for successor state and goal-dependent value function learning, either for discrete or for continuous environments with function approximation. Especially, we provide finite-variance estimators even in continuous environments, where the reward for exactly reaching a goal state becomes infinitely sparse. Successor states satisfy more than just the Bellman equation: a backward Bellman operator and a Bellman-Newton (BN) operator encode path compositionality in the environment. The BN operator is akin to second-order gradient descent methods and provides the true update of the value function when acquiring more observations, with explicit tabular bounds. In the tabular case and with infinitesimal learning rates, mixing the usual and backward Bellman operators provably improves eigenvalues for asymptotic convergence, and the asymptotic convergence of the BN operator is provably better than TD, with a rate independent from the environment. However, the BN method is more complex and less robust to sampling noise. Finally, a forward-backward (FB) finite-rank parameterization of successor states enjoys reduced variance and improved samplability, provides a direct model of the value function, has fully understood fixed points corresponding to long-range dependencies, approximates the BN method, and provides two canonical representations of states as a byproduct.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2018

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

In reinforcement learning, temporal difference (TD) is the most direct a...
research
09/10/2022

Gradient Descent Temporal Difference-difference Learning

Off-policy algorithms, in which a behavior policy differs from the targe...
research
05/23/2019

Recurrent Value Functions

Despite recent successes in Reinforcement Learning, value-based methods ...
research
03/14/2021

Learning One Representation to Optimize All Rewards

We introduce the forward-backward (FB) representation of the dynamics of...
research
08/15/2019

Mapping State Space using Landmarks for Universal Goal Reaching

An agent that has well understood the environment should be able to appl...
research
07/25/2019

Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation

We study the policy evaluation problem in multi-agent reinforcement lear...
research
05/20/2022

Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space

Intrinsic motivation generates behaviors that do not necessarily lead to...

Please sign up or login with your details

Forgot password? Click here to reset