Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

02/12/2018
by   Ronan Fruit, et al.
0

We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov Decision Process (MDP) for which an upper bound c on the span of the optimal bias function is known. For an MDP with S states, A actions and Gamma <= S possible next states, we prove a regret bound of O(c√(Gamma SAT)), which significantly improves over existing algorithms (e.g., UCRL and PSRL), whose regret scales linearly with the MDP diameter D. In fact, the optimal bias span is finite and often much smaller than D (e.g., D=infinity in non-communicating MDPs). A similar result was originally derived by Bartlett and Tewari (2009) for REGAL.C, for which no tractable algorithm is available. In this paper, we relax the optimization problem at the core of REGAL.C, we carefully analyze its properties, and we provide the first computationally efficient algorithm to solve it. Finally, we report numerical simulations supporting our theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2018

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

While designing the state space of an MDP, it is common to include state...
research
12/11/2018

Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

We introduce and analyse two algorithms for exploration-exploitation in ...
research
03/05/2018

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...
research
02/06/2020

Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (F...
research
02/21/2023

Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

In this paper, we revisit the regret of undiscounted reinforcement learn...
research
03/11/2020

Model-Free Algorithm and Regret Analysis for MDPs with Peak Constraints

In the optimization of dynamic systems, the variables typically have con...
research
10/05/2022

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

We study the problem of episodic reinforcement learning in continuous st...

Please sign up or login with your details

Forgot password? Click here to reset