Best Policy Identification in Linear MDPs

08/11/2022
by   Jerome Taupin, et al.
10

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an ε-optimal policy with probability 1-δ. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by O(d/(ε+Δ)^2 (log(1/δ)+d)) where Δ denotes the minimum reward gap of sub-optimal actions and d is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all δ), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2020

Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...
research
06/13/2022

Near-Optimal Sample Complexity Bounds for Constrained MDPs

In contrast to the advances in characterizing the sample complexity for ...
research
06/05/2021

Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...
research
05/30/2023

Sharp high-probability sample complexities for policy evaluation with linear function approximation

This paper is concerned with the problem of policy evaluation with linea...
research
12/24/2021

Scenario-Based Verification of Uncertain Parametric MDPs

We consider parametric Markov decision processes (pMDPs) that are augmen...
research
06/14/2023

Theoretical Hardness and Tractability of POMDPs in RL with Partial Hindsight State Information

Partially observable Markov decision processes (POMDPs) have been widely...
research
12/01/2022

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

This work considers the sample complexity of obtaining an ε-optimal poli...

Please sign up or login with your details

Forgot password? Click here to reset