Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

02/09/2021
by   Haike Xu, et al.
0

This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal Q-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional Ω(S/Δ_min) regret due to over-exploration where Δ_min is the minimum sub-optimality gap and S is the number of states. We further show that for general MDPs, AMB suffers an additional |Z_mul|/Δ_min regret, where Z_mul is the set of state-action pairs (s,a)'s satisfying a is a non-unique optimal action for s. We complement our upper bound with a lower bound showing the dependency on |Z_mul|/Δ_min is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2021

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

We derive a novel asymptotic problem-dependent lower-bound for regret mi...
research
10/20/2022

Horizon-Free Reinforcement Learning for Latent Markov Decision Processes

We study regret minimization for reinforcement learning (RL) in Latent M...
research
06/16/2020

Q-learning with Logarithmic Regret

This paper presents the first non-asymptotic result showing that a model...
research
06/10/2020

Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algo...
research
02/26/2018

Variance Reduction Methods for Sublinear Reinforcement Learning

This work considers the problem of provably optimal reinforcement learni...
research
03/01/2021

UCB Momentum Q-learning: Correcting the bias without forgetting

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algo...
research
05/16/2022

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabul...

Please sign up or login with your details

Forgot password? Click here to reset