Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

06/24/2021
by   Zaiwei Chen, et al.
0

In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted ℓ_p-norm for each p in [1,∞), with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. Q^π(λ), Tree-Backup(λ), Retrace(λ), and Q-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for Q^π(λ), Tree-Backup(λ), and Retrace(λ), and improve the best known bounds of Q-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/05/2022

Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation

In this work, we study policy-based methods for solving the reinforcemen...
research
09/09/2023

Finite-sample analysis of rotation operator under l_2 norm and l_∞ norm

In this article, we consider a special operator called the two-dimension...
research
02/03/2020

Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes

Stochastic Approximation (SA) is a popular approach for solving fixed po...
research
09/17/2015

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

We consider the off-policy evaluation problem in Markov decision process...
research
08/14/2015

Emphatic TD Bellman Operator is a Contraction

Recently, SuttonMW15 introduced the emphatic temporal differences (ETD) ...
research
03/24/2020

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Off-policy estimation for long-horizon problems is important in many rea...
research
02/02/2021

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

This paper develops an unified framework to study finite-sample converge...

Please sign up or login with your details

Forgot password? Click here to reset