DeepAI AI Chat
Log In Sign Up

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

06/06/2020
by   Zihan Zhang, et al.
0

In this paper we consider the problem of learning an ϵ-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ∈ (0,1), and an approximation threshold ϵ > 0, we provide a model-free algorithm to learn an ϵ-optimal policy with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^5.5) (where the notation Õ(·) hides poly-logarithmic factors of S,A,1/(1-γ), and 1/ϵ) and success probability (1-p). For small enough ϵ, we show an improved algorithm with sample complexity Õ(SAln(1/p)/ϵ^2(1-γ)^3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

READ FULL TEXT

page 1

page 2

page 3

page 4

03/07/2021

A Lower Bound for the Sample Complexity of Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is the task of finding a reward fun...
01/27/2019

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

A fundamental question in reinforcement learning is whether model-free a...
02/21/2017

Sample Efficient Policy Search for Optimal Stopping Domains

Optimal stopping problems consider the question of deciding when to stop...
01/31/2018

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...
02/12/2021

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Q-learning, which seeks to learn the optimal Q-function of a Markov deci...
02/24/2020

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

It has been a trend in the Reinforcement Learning literature to derive s...
10/09/2021

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Achieving sample efficiency in online episodic reinforcement learning (R...