Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

03/24/2022
by   Zihan Zhang, et al.
0

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound independent on the planning horizon. Specifically, we consider tabular MDP with S states, A actions, a planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O(poly(S,A,log K)√(K)) regret in contrast to existing bounds which either has an additional polylog(H) dependency <cit.> or has an exponential dependency on S <cit.>. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Efficient Learning in Non-Stationary Linear Markov Decision Processes

We study episodic reinforcement learning in non-stationary linear (a.k.a...
research
01/29/2021

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

We show how to construct variance-aware confidence sets for linear bandi...
research
01/16/2013

The Complexity of Decentralized Control of Markov Decision Processes

Planning for distributed agents with partial state information is consid...
research
09/28/2020

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Episodic reinforcement learning and contextual bandits are two widely st...
research
03/25/2021

Nearly Horizon-Free Offline Reinforcement Learning

We revisit offline reinforcement learning on episodic time-homogeneous t...
research
05/16/2022

Efficient Algorithms for Planning with Participation Constraints

We consider the problem of planning with participation constraints intro...
research
06/14/2021

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Designing provably efficient algorithms with general function approximat...

Please sign up or login with your details

Forgot password? Click here to reset