Adaptive Reward-Free Exploration

06/11/2020 ∙ by Emilie Kaufmann, et al. ∙ 3

Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs O((SAH^4/ε^2)ln(1/δ)) episodes to output, with probability 1-δ, an ε-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.