Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

09/28/2020
by   Zihan Zhang, et al.
0

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, Monotonic Value Propagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the constants to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an O((√(SAK) + S^2A) polylog(SAHK)) regret, approaching the Ω(√(SAK)) lower bound of contextual bandits. Notably, this result 1) exponentially improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019] and Zhang et al. [2020] in terms of the dependency on H, and 2) exponentially improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

A Model Selection Approach for Corruption Robust Reinforcement Learning

We develop a model selection approach to tackle reinforcement learning w...
research
02/02/2023

Stochastic Contextual Bandits with Long Horizon Rewards

The growing interest in complex decision-making and language modeling pr...
research
03/24/2022

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov ...
research
05/18/2022

Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs

Reinforcement learning (RL) generalizes bandit problems with additional ...
research
03/29/2023

Does Sparsity Help in Learning Misspecified Linear Bandits?

Recently, the study of linear misspecified bandits has generated intrigu...
research
05/01/2020

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Learning to plan for long horizons is a central challenge in episodic re...
research
02/07/2023

Layered State Discovery for Incremental Autonomous Exploration

We study the autonomous exploration (AX) problem proposed by Lim Aue...

Please sign up or login with your details

Forgot password? Click here to reset