Near-optimal Reinforcement Learning in Factored MDPs

03/15/2014
by   Ian Osband, et al.
0

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer Ω(√(SAT)) regret on some MDP, where T is the elapsed time and S and A are the cardinalities of the state and action spaces. This implies T = Ω(SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, S and A can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factored MDP, which may be exponentially smaller than S or A. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2020

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and a...
research
10/15/2022

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

In this paper, we study the episodic reinforcement learning (RL) probl...
research
11/01/2021

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Multi-agent reinforcement learning (MARL) problems are challenging due t...
research
02/09/2021

RL for Latent MDPs: Regret Guarantees and a Lower Bound

In this work, we consider the regret minimization problem for reinforcem...
research
11/21/2017

Posterior Sampling for Large Scale Reinforcement Learning

Posterior sampling for reinforcement learning (PSRL) is a popular algori...
research
12/26/2021

Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

The field of General Reinforcement Learning (GRL) formulates the problem...
research
02/14/2012

Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search

Bayes-optimal behavior, while well-defined, is often difficult to achiev...

Please sign up or login with your details

Forgot password? Click here to reset