Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

01/27/2019
by   Kefan Dong, et al.
0

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. jin2018q proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards without accessing a generative model. We show that the sample complexity of exploration of our algorithm is bounded by Õ(SA/ϵ^2(1-γ)^7). This improves the previously best known result of Õ(SA/ϵ^4(1-γ)^8) in this setting achieved by delayed Q-learning strehl2006pac, and matches the lower bound in terms of ϵ as well as S and A except for logarithmic factors.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro