Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning

06/04/2022
by   Brett Daley, et al.
0

Q(σ) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter σ∈ [0,1] perform better in practice, and therefore it is commonly believed that σ functions as a bias-variance trade-off parameter to achieve these improvements. In our work, we disprove this notion, showing that the choice of σ=0 minimizes variance without increasing bias. This indicates that σ must have some other effect on learning that is not fully understood. As an alternative, we hypothesize the existence of a new trade-off: larger σ-values help overcome poor initializations of the value function, at the expense of higher statistical variance. To automatically balance these considerations, we propose Adaptive Tree Backup (ATB) methods, whose weighted backups evolve as the agent gains experience. Our experiments demonstrate that adaptive strategies can be more effective than relying on fixed or time-annealed σ-values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2016

Adaptive Lambda Least-Squares Temporal Difference Learning

Temporal Difference learning or TD(λ) is a fundamental algorithm in the ...
research
07/05/2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
research
10/13/2021

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...
research
01/25/2018

Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

This paper investigates estimating the variance of a temporal-difference...
research
10/16/2018

The Concept of Criticality in Reinforcement Learning

Reinforcement learning methods carry a well known bias-variance trade-of...
research
11/17/2020

Leveraging the Variance of Return Sequences for Exploration Policy

This paper introduces a method for constructing an upper bound for explo...
research
11/24/2021

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Accurate value estimates are important for off-policy reinforcement lear...

Please sign up or login with your details

Forgot password? Click here to reset