A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

02/09/2018
by   Long Yang, et al.
0

Recently, a new multi-step temporal learning algorithm, called Q(σ), unifies n-step Tree-Backup (when σ=0) and n-step Sarsa (when σ=1) by introducing a sampling parameter σ. However, similar to other multi-step temporal-difference learning algorithms, Q(σ) needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original Q(σ), combine it with eligibility traces and propose a new algorithm, called Q(σ ,λ), in which λ is trace-decay parameter. This idea unifies Sarsa(λ) (when σ =1) and Q^π(λ) (when σ =0). Furthermore, we give an upper error bound of Q(σ ,λ) policy evaluation algorithm. We prove that Q(σ,λ) control algorithm can converge to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of σ, Q(σ ,λ) creates a mixture of the existing algorithms that can learn the optimal value significantly faster than the extreme end (σ=0, or 1).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
research
03/13/2023

n-Step Temporal Difference Learning with Optimal n

We consider the problem of finding the optimal value of n in the n-step ...
research
11/05/2017

Double Q(σ) and Q(σ, λ): Unifying Reinforcement Learning Control Algorithms

Temporal-difference (TD) learning is an important field in reinforcement...
research
05/17/2019

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging...
research
06/16/2020

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Temporal-Difference (TD) learning is a standard and very successful rein...
research
02/08/2019

Source Traces for Temporal Difference Learning

This paper motivates and develops source traces for temporal difference ...
research
10/13/2021

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...

Please sign up or login with your details

Forgot password? Click here to reset