Value function estimation in Markov reward processes: Instance-dependent ℓ_∞-bounds for policy evaluation

09/19/2019
by   Ashwin Pananjady, et al.
0

Markov reward processes (MRPs) are used to model stochastic phenomena arising in operations research, control engineering, robotics, artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the long-term value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of estimating the value function of an infinite-horizon, discounted MRP in the ℓ_∞-norm. We analyze both the standard plug-in approach to this problem and a more robust variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observed data. We show that these approaches are minimax-optimal up to constant factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2023

The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation

Theoretical guarantees in reinforcement learning (RL) are known to suffe...
research
09/21/2018

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

In reinforcement learning (RL) , one of the key components is policy eva...
research
03/09/2020

Transfer Reinforcement Learning under Unobserved Contextual Information

In this paper, we study a transfer reinforcement learning problem where ...
research
09/24/2021

Optimal policy evaluation using kernel-based temporal difference methods

We study methods based on reproducing kernel Hilbert spaces for estimati...
research
11/07/2022

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

We study non-parametric estimation of the value function of an infinite-...
research
06/27/2012

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Motivated by value function estimation in reinforcement learning, we stu...
research
06/01/2018

Learning convex bounds for linear quadratic control policy synthesis

Learning to make decisions from observed data in dynamic environments re...

Please sign up or login with your details

Forgot password? Click here to reset