DeepAI AI Chat
Log In Sign Up

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

by   Zihan Zhang, et al.

In this paper we consider the problem of learning an ϵ-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ∈ (0,1), and an approximation threshold ϵ > 0, we provide a model-free algorithm to learn an ϵ-optimal policy with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^5.5) (where the notation Õ(·) hides poly-logarithmic factors of S,A,1/(1-γ), and 1/ϵ) and success probability (1-p). For small enough ϵ, we show an improved algorithm with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.


page 1

page 2

page 3

page 4


A Lower Bound for the Sample Complexity of Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is the task of finding a reward fun...

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

A fundamental question in reinforcement learning is whether model-free a...

Sample Efficient Policy Search for Optimal Stopping Domains

Optimal stopping problems consider the question of deciding when to stop...

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Q-learning, which seeks to learn the optimal Q-function of a Markov deci...

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

It has been a trend in the Reinforcement Learning literature to derive s...

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Achieving sample efficiency in online episodic reinforcement learning (R...