Optimal policy evaluation using kernel-based temporal difference methods

09/24/2021
by   Yaqi Duan, et al.
1

We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate; in the population limit of infinite data, it corresponds to the fixed point of a projected Bellman operator defined by the associated reproducing kernel Hilbert space. The estimator itself is obtained by computing the projected fixed point induced by a regularized version of the empirical operator; due to the underlying kernel structure, this reduces to solving a linear system involving kernel matrices. We analyze the error of this estimate in the L^2(μ)-norm, where μ denotes the stationary distribution of the underlying Markov chain. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H = (1 - γ)^-1. Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2022

Optimal variance-reduced stochastic approximation in Banach spaces

We study the problem of estimating the fixed point of a contractive oper...
research
11/07/2022

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

We study non-parametric estimation of the value function of an infinite-...
research
09/19/2019

Value function estimation in Markov reward processes: Instance-dependent ℓ_∞-bounds for policy evaluation

Markov reward processes (MRPs) are used to model stochastic phenomena ar...
research
06/11/2019

Variance-reduced Q-learning is minimax optimal

We introduce and analyze a form of variance-reduced Q-learning. For γ-di...
research
12/09/2020

Optimal oracle inequalities for solving projected fixed-point equations

Linear fixed point equations in Hilbert spaces arise in a variety of set...
research
01/22/2013

Properties of the Least Squares Temporal Difference learning algorithm

This paper presents four different ways of looking at the well-known Lea...
research
10/07/2022

Estimation of the Order of Non-Parametric Hidden Markov Models using the Singular Values of an Integral Operator

We are interested in assessing the order of a finite-state Hidden Markov...

Please sign up or login with your details

Forgot password? Click here to reset