When is Realizability Sufficient for Off-Policy Reinforcement Learning?

11/10/2022
by   Andrea Zanette, et al.
0

Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met. However, Bellman completeness is a requirement that is much stronger than realizability and that is deemed to be too strong to hold in practice. In this work, we relax this structural assumption and analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error, and that depend on the interplay of three factors. The first two are well-know: they are the metric entropy of the function class and the concentrability coefficient that represents the cost of learning off-policy. The third factor is new, and it measures the violation of Bellman completeness, namely the mis-alignment between the chosen function class and its image through the Bellman operator. In essence, these error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force. Our analysis directly applies to the solution found by temporal difference algorithms when they converge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2021

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

We offer a theoretical characterization of off-policy evaluation (OPE) i...
research
04/07/2021

Finite-Sample Analysis for Two Time-scale Non-linear TDC with General Smooth Function Approximation

Temporal-difference learning with gradient correction (TDC) is a two tim...
research
09/29/2021

Online Robust Reinforcement Learning with Model Uncertainty

Robust reinforcement learning (RL) is to find a policy that optimizes th...
research
11/08/2020

Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient

This paper provides a statistical analysis of high-dimensional batch Rei...
research
11/03/2019

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Motivated by the emerging use of multi-agent reinforcement learning (MAR...
research
02/20/2020

Adaptive Temporal Difference Learning with Linear Function Approximation

This paper revisits the celebrated temporal difference (TD) learning alg...
research
11/20/2019

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Policy evaluation in reinforcement learning is often conducted using two...

Please sign up or login with your details

Forgot password? Click here to reset