Separating value functions across time-scales

02/05/2019
by   Joshua Romoff, et al.
0

In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return - in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the cost of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable - where the system must optimize for longer horizons at higher discounts - the target of the value function approximator may increase in variance leading to difficulties in learning. We present an extension of temporal difference (TD) learning, which we call TD(Δ), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. The separation of a longer horizon value function into these components has useful properties in scalability and performance. We discuss these properties and show theoretic and empirical improvements over standard TD learning in certain settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2019

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

We explore fixed-horizon temporal difference (TD) methods, reinforcement...
research
04/07/2021

The Value of Planning for Infinite-Horizon Model Predictive Control

Model Predictive Control (MPC) is a classic tool for optimal control of ...
research
05/24/2018

Meta-Gradient Reinforcement Learning

The goal of reinforcement learning algorithms is to estimate and/or opti...
research
01/19/2023

Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control

For a receding-horizon controller with a known system and with an approx...
research
10/30/2010

Predictive State Temporal Difference Learning

We propose a new approach to value function approximation which combines...
research
12/31/2019

The Gambler's Problem and Beyond

We analyze the Gambler's problem, a simple reinforcement learning proble...
research
06/07/2023

Online Multi-Contact Receding Horizon Planning via Value Function Approximation

Planning multi-contact motions in a receding horizon fashion requires a ...

Please sign up or login with your details

Forgot password? Click here to reset