Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

09/28/2020
by   Aymen Al Marjani, et al.
0

We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) with finite state and action spaces. We assume that the agent has access to a generative model and that the MDP possesses a unique optimal policy. In this setting, we derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimal gaps and the variance of the next-state value function, and thus really summarizes the hardness of the MDP. We devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are finally discussed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2012

On the Sample Complexity of Reinforcement Learning with a Generative Model

We consider the problem of learning the optimal action-value function in...
research
08/11/2022

Best Policy Identification in Linear MDPs

We investigate the problem of best policy identification in discounted l...
research
02/17/2020

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

The current paper studies the problem of agnostic Q-learning with functi...
research
05/23/2018

Representation Balancing MDPs for Off-Policy Policy Evaluation

We study the problem of off-policy policy evaluation (OPPE) in RL. In co...
research
12/01/2022

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

This work considers the sample complexity of obtaining an ε-optimal poli...
research
06/05/2021

Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...
research
03/22/2022

A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

Q-learning with function approximation could diverge in the off-policy s...

Please sign up or login with your details

Forgot password? Click here to reset