Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

07/06/2018
by   Ronan Fruit, et al.
0

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce , the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with S^C communicating states, A actions and Γ^C≤ S^C possible communicating next states, we derive a O(D^C√(Γ^C S^C AT)) regret bound, where D^C is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2018

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

We introduce SCAL, an algorithm designed to perform efficient exploratio...
research
12/11/2018

Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

We introduce and analyse two algorithms for exploration-exploitation in ...
research
07/10/2020

Improved Analysis of UCRL2 with Empirical Bernstein Inequality

We consider the problem of exploration-exploitation in communicating Mar...
research
02/28/2019

Active Exploration in Markov Decision Processes

We introduce the active exploration problem in Markov decision processes...
research
10/08/2019

Receding Horizon Curiosity

Sample-efficient exploration is crucial not only for discovering rewardi...
research
07/20/2021

Similarity metrics for Different Market Scenarios in Abides

Markov Decision Processes (MDPs) are an effective way to formally descri...
research
12/08/2020

Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes

The parameters for a Markov Decision Process (MDP) often cannot be speci...

Please sign up or login with your details

Forgot password? Click here to reset