Approximate discounting-free policy evaluation from transient and recurrent states

04/08/2022
by   Vektor Dewanto, et al.
4

In order to distinguish policies that prescribe good from bad actions in transient states, we need to evaluate the so-called bias of a policy from transient states. However, we observe that most (if not all) works in approximate discounting-free policy evaluation thus far are developed for estimating the bias solely from recurrent states. We therefore propose a system of approximators for the bias (specifically, its relative value) from transient and recurrent states. Its key ingredient is a seminorm LSTD (least-squares temporal difference), for which we derive its minimizer expression that enables approximation by sampling required in model-free reinforcement learning. This seminorm LSTD also facilitates the formulation of a general unifying procedure for LSTD-based policy value approximators. Experimental results validate the effectiveness of our proposed method.

READ FULL TEXT

page 9

page 23

page 24

page 25

research
05/28/2021

A nearly Blackwell-optimal policy gradient method

For continuing environments, reinforcement learning methods commonly max...
research
05/05/2021

Model-free policy evaluation in Reinforcement Learning via upper solutions

In this work we present an approach for building tight model-free confid...
research
04/06/2019

Randomised Bayesian Least-Squares Policy Iteration

We introduce Bayesian least-squares policy iteration (BLSPI), an off-pol...
research
02/26/2020

Policy Evaluation Networks

Many reinforcement learning algorithms use value functions to guide the ...
research
06/11/2021

Preferential Temporal Difference Learning

Temporal-Difference (TD) learning is a general and very useful tool for ...
research
05/02/2018

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

In reinforcement learning, temporal difference (TD) is the most direct a...
research
02/23/2017

Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively stu...

Please sign up or login with your details

Forgot password? Click here to reset