Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
In this paper we consider the problem of learning an ϵ-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ∈ (0,1), and an approximation threshold ϵ > 0, we provide a model-free algorithm to learn an ϵ-optimal policy with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^5.5) (where the notation Õ(·) hides poly-logarithmic factors of S,A,1/(1-γ), and 1/ϵ) and success probability (1-p). For small enough ϵ, we show an improved algorithm with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
READ FULL TEXT