
Regret Bounds for Discounted MDPs
Recently, it has been shown that carefully designed reinforcement learni...
read it

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
We study minimax optimal reinforcement learning in episodic factored Mar...
read it

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs
In order to make good decision under uncertainty an agent must learn fro...
read it

Regret Bounds for Reinforcement Learning via Markov Chain Concentration
We give a simple optimistic algorithm for which it is easy to derive reg...
read it

ExplorationExploitation in MDPs with Options
While a large body of empirical results show that temporallyextended ac...
read it

On Optimism in ModelBased Reinforcement Learning
The principle of optimism in the face of uncertainty is prevalent throug...
read it

Almost Optimal ModelFree Reinforcement Learning via ReferenceAdvantage Decomposition
We study the reinforcement learning problem in the setting of finitehor...
read it
Minimax Optimal Reinforcement Learning for Discounted MDPs
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) in the tabular setting. We propose a modelbased algorithm named UCBVIγ, which is based on the optimism in the face of uncertainty principle and the Bernsteintype bonus. It achieves Õ(√(SAT)/(1γ)^1.5) regret, where S is the number of states, A is the number of actions, γ is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least Ω̃(√(SAT)/(1γ)^1.5). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVIγ is near optimal for discounted MDPs.
READ FULL TEXT
Comments
There are no comments yet.