Off-policy Learning with Eligibility Traces: A Survey

04/15/2013
by   Matthieu Geist, et al.
0

In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms - off-policy LSTD(λ), LSPE(λ), TD(λ), TDC/GQ(λ) - and suggests new extensions - off-policy FPKF(λ), BRM(λ), gBRM(λ), GTD2(λ). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) - and TD(λ) if the feature space dimension is too large for a least-squares approach - perform the best.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2015

Policy Gradient Methods for Off-policy Control

Off-policy learning refers to the problem of learning the value function...
research
05/25/2017

Convergent Tree-Backup and Retrace with Function Approximation

Off-policy learning is key to scaling up reinforcement learning as it al...
research
12/27/2017

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for pol...
research
10/13/2022

Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation

We provide performance guarantees for a variant of simulation-based poli...
research
06/18/2020

Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning

In this paper we propose novel distributed gradient-based temporal diffe...
research
11/06/2018

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where...
research
09/16/2022

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Lying on the heart of intelligent decision-making systems, how policy is...

Please sign up or login with your details

Forgot password? Click here to reset