Navigating to the Best Policy in Markov Decision Processes

06/05/2021
by   Aymen Al Marjani, et al.
24

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose an information-theoretic lower bound on the average number of steps required before a correct answer can be given with probability at least 1-δ. This lower bound involves a non-convex optimization problem, for which we propose a convex relaxation. We further provide an algorithm whose sample complexity matches the relaxed lower bound up to a factor 2. This algorithm addresses general communicating MDPs; we propose a variant with reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting <cit.>, where the agent could at each step observe the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/11/2022

Best Policy Identification in Linear MDPs

We investigate the problem of best policy identification in discounted l...
research
09/28/2020

Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...
research
11/30/2022

The Smoothed Complexity of Policy Iteration for Markov Decision Processes

We show subexponential lower bounds (i.e., 2^Ω (n^c)) on the smoothed co...
research
09/22/2019

Faster saddle-point optimization for solving large-scale Markov decision processes

We consider the problem of computing optimal policies in average-reward ...
research
02/06/2013

Correlated Action Effects in Decision Theoretic Regression

Much recent research in decision theoretic planning has adopted Markov d...
research
11/17/2022

Learning Mixtures of Markov Chains and MDPs

We present an algorithm for use in learning mixtures of both Markov chai...
research
02/02/2023

A general Markov decision process formalism for action-state entropy-regularized reward maximization

Previous work has separately addressed different forms of action, state ...

Please sign up or login with your details

Forgot password? Click here to reset