Minimax Optimal Reinforcement Learning for Discounted MDPs

10/01/2020 ∙ by Jiafan He, et al. ∙ 17

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) in the tabular setting. We propose a model-based algorithm named UCBVI-γ, which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. It achieves Õ(√(SAT)/(1-γ)^1.5) regret, where S is the number of states, A is the number of actions, γ is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least Ω̃(√(SAT)/(1-γ)^1.5). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-γ is near optimal for discounted MDPs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.