Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

06/12/2019
by   Zihan Zhang, et al.
0

We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h^*, the proposed algorithm achieves a regret bound of Õ(√(SAHT))for MDP with S states and A actions, in the case that an upper bound H on the span of h^*, i.e., sp(h^*) is known. This result outperforms the best previous regret bounds Õ(HS√(AT)) [Bartlett and Tewari, 2009] by a factor of √(SH). Furthermore, this regret bound matches the lower bound of Ω(√(SAHT)) [Jaksch et al., 2010] up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of Õ(√(SADT)) for MDPs with finite diameter D compared to the lower bound of Ω(√(SADT)) [Jaksch et al., 2010].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/24/2019

Stochastic Lipschitz Q-Learning

In an episodic Markov Decision Process (MDP) problem, an online algorith...
research
01/02/2021

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Many real-world applications, such as those in medical domains, recommen...
research
03/05/2018

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...
research
05/27/2019

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

We study model-based reinforcement learning in an unknown finite communi...
research
03/16/2017

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement...
research
11/16/2020

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

The principle of Reward-Biased Maximum Likelihood Estimate Based Adaptiv...
research
02/06/2020

Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (F...

Please sign up or login with your details

Forgot password? Click here to reset