A Tale of Sampling and Estimation in Discounted Reinforcement Learning

04/11/2023
by   Alberto Maria Metelli, et al.
0

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2020

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

We consider the problem of off-policy evaluation for reinforcement learn...
research
09/13/2021

Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods

When designing algorithms for finite-time-horizon episodic reinforcement...
research
02/10/2020

Statistically Efficient Off-Policy Policy Gradients

Policy gradient methods in reinforcement learning update policy paramete...
research
10/29/2018

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

We consider the off-policy estimation problem of estimating the expected...
research
02/10/2022

Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory

Off-Policy Evaluation (OPE) serves as one of the cornerstones in Reinfor...
research
05/23/2019

Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprent...
research
02/19/2023

Distributional Offline Policy Evaluation with Predictive Error Guarantees

We study the problem of estimating the distribution of the return of a p...

Please sign up or login with your details

Forgot password? Click here to reset