DeepAI AI Chat
Log In Sign Up

Best Policy Identification in Linear MDPs

08/11/2022
by   Jerome Taupin, et al.
10

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an ε-optimal policy with probability 1-δ. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by O(d/(ε+Δ)^2 (log(1/δ)+d)) where Δ denotes the minimum reward gap of sub-optimal actions and d is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all δ), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/28/2020

Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...
06/13/2022

Near-Optimal Sample Complexity Bounds for Constrained MDPs

In contrast to the advances in characterizing the sample complexity for ...
06/05/2021

Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...
03/17/2022

Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

In probably approximately correct (PAC) reinforcement learning (RL), an ...
11/02/2021

Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

We study the problem of the identification of m arms with largest means ...
12/24/2021

Scenario-Based Verification of Uncertain Parametric MDPs

We consider parametric Markov decision processes (pMDPs) that are augmen...