DeepAI AI Chat
Log In Sign Up

Best Policy Identification in Linear MDPs

by   Jerome Taupin, et al.

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an ε-optimal policy with probability 1-δ. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by O(d/(ε+Δ)^2 (log(1/δ)+d)) where Δ denotes the minimum reward gap of sub-optimal actions and d is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all δ), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.


page 1

page 2

page 3

page 4


Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...

Near-Optimal Sample Complexity Bounds for Constrained MDPs

In contrast to the advances in characterizing the sample complexity for ...

Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...

Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

In probably approximately correct (PAC) reinforcement learning (RL), an ...

Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

We study the problem of the identification of m arms with largest means ...

Scenario-Based Verification of Uncertain Parametric MDPs

We consider parametric Markov decision processes (pMDPs) that are augmen...