Model-free policy evaluation in Reinforcement Learning via upper solutions
In this work we present an approach for building tight model-free confidence intervals for the optimal value function V^⋆ in general infinite horizon MDPs via the upper solutions. We suggest a novel upper value iterative procedure (UVIP) to construct upper solutions for a given agent's policy. UVIP leads to a model free method of policy evaluation. We analyze convergence properties of the approximate UVIP under rather general assumptions and illustrate its performance on a number of benchmark RL problems.
READ FULL TEXT