Loop estimator for discounted values in Markov reward processes

02/15/2020
by   Falcon Z. Dai, et al.
0

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of state with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of O(1) when estimating the value of a single positive recurrent state s unlike TD (with O(S)) or model-based methods (with O(S^2)). Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of O(√(τ_s/T)) over steps T on a single sample path, where τ_s is the maximal expected hitting time to state s. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2023

On Reward Structures of Markov Decision Processes

A Markov decision process can be parameterized by a transition kernel an...
research
02/22/2023

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes

The classical algorithms used in tabular reinforcement learning (Value I...
research
04/09/2019

Practical Open-Loop Optimistic Planning

We consider the problem of online planning in a Markov Decision Process ...
research
12/12/2012

Polynomial Value Iteration Algorithms for Detrerminstic MDPs

Value iteration is a commonly used and empirically competitive method in...
research
01/31/2018

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...
research
10/04/2022

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

We consider the task of estimating a structural model of dynamic decisio...
research
01/20/2022

Two-Sample Testing in Reinforcement Learning

Value-based reinforcement-learning algorithms have shown strong performa...

Please sign up or login with your details

Forgot password? Click here to reset