Model-free policy evaluation in Reinforcement Learning via upper solutions

05/05/2021
by   D. Belomestny, et al.
0

In this work we present an approach for building tight model-free confidence intervals for the optimal value function V^⋆ in general infinite horizon MDPs via the upper solutions. We suggest a novel upper value iterative procedure (UVIP) to construct upper solutions for a given agent's policy. UVIP leads to a model free method of policy evaluation. We analyze convergence properties of the approximate UVIP under rather general assumptions and illustrate its performance on a number of benchmark RL problems.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset