Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

06/23/2020 ∙ by Dongruo Zhou, et al. ∙ 16

Modern tasks in reinforcement learning are always with large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represents states and actions in a low dimensional space. In this paper, we study reinforcement learning with feature mapping for discounted Markov Decision Processes (MDPs). We propose a novel algorithm which makes use of the feature mapping and obtains a Õ(d√(T)/(1-γ)^2) regret, where d is the dimension of the feature space, T is the time horizon and γ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by Ω(d√(T)/(1-γ)^1.5). Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a (1-γ)^-0.5 factor.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.