Q-learning with Logarithmic Regret

06/16/2020
by   Kunhe Yang, et al.
0

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal Q-function. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a O(SA·poly(H)/gap_minlog(SAT)) cumulative regret bound, where S is the number of states, A is the number of actions, H is the planning horizon, T is the total number of steps, and gap_min is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of S,A,T up to a log(SA) factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2020

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Reinforcement learning (RL) with linear function approximation has recei...
research
07/01/2021

Gap-Dependent Bounds for Two-Player Markov Games

As one of the most popular methods in the field of reinforcement learnin...
research
03/16/2023

On the Interplay Between Misspecification and Sub-optimality Gap in Linear Contextual Bandits

We study linear contextual bandits in the misspecified setting, where th...
research
02/09/2021

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

This paper presents a new model-free algorithm for episodic finite-horiz...
research
11/03/2021

Effective guessing has unlikely consequences

A classic result of Paul, Pippenger, Szemerédi and Trotter states that D...
research
06/14/2021

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Designing provably efficient algorithms with general function approximat...
research
03/03/2022

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

In this paper, we study the problem of regret minimization for episodic ...

Please sign up or login with your details

Forgot password? Click here to reset