Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

01/26/2023
by   Brett Daley, et al.
0

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across λ-values in an off-policy control task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2021

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

Off-policy learning from multistep returns is crucial for sample-efficie...
research
11/06/2021

SOPE: Spectrum of Off-Policy Estimators

Many sequential decision making problems are high-stakes and require off...
research
05/17/2019

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging...
research
07/21/2023

Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

Oftentimes, environments for sequential decision-making problems can be ...
research
10/16/2019

Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for o...
research
12/13/2021

Lifelong Hyper-Policy Optimization with Multiple Importance Sampling Regularization

Learning in a lifelong setting, where the dynamics continually evolve, i...
research
04/18/2017

Investigating Recurrence and Eligibility Traces in Deep Q-Networks

Eligibility traces in reinforcement learning are used as a bias-variance...

Please sign up or login with your details

Forgot password? Click here to reset