Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

11/07/2022
by   Yaqi Duan, et al.
0

We study non-parametric estimation of the value function of an infinite-horizon γ-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical K-step look-ahead TD for K = 1, 2, … and the TD(λ) family for λ∈ [0,1) as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.

READ FULL TEXT
research
09/24/2021

Optimal policy evaluation using kernel-based temporal difference methods

We study methods based on reproducing kernel Hilbert spaces for estimati...
research
12/23/2021

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

We study stochastic approximation procedures for approximately solving a...
research
09/19/2019

Value function estimation in Markov reward processes: Instance-dependent ℓ_∞-bounds for policy evaluation

Markov reward processes (MRPs) are used to model stochastic phenomena ar...
research
07/25/2019

Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation

We study the policy evaluation problem in multi-agent reinforcement lear...
research
02/22/2018

Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

We prove that the ordinary least-squares (OLS) estimator attains nearly ...
research
12/09/2022

Non-parametric estimation of mixed discrete choice models

In this paper, different strands of literature are combined in order to ...
research
01/30/2023

On the Statistical Benefits of Temporal Difference Learning

Given a dataset on actions and resulting long-term rewards, a direct est...

Please sign up or login with your details

Forgot password? Click here to reset