Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

12/01/2022
by   Jinghan Wang, et al.
0

This work considers the sample complexity of obtaining an ε-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of O(H ε^-3ln1/δ) samples per state-action pair, where H := sp(h^*) is the span of bias of any optimal policy, ε is the accuracy and δ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of Ω(|𝒮| |𝒜| H ε^-2ln1/δ) total samples, showing that a linear dependent on H is necessary and that our upper bound matches the lower bound in all parameters of (|𝒮|, |𝒜|, H, ln1/δ) up to some logarithmic factors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2021

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

We prove new upper and lower bounds for sample complexity of finding an ...
research
09/28/2020

Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...
research
05/23/2019

Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprent...
research
07/11/2012

Discretized Approximations for POMDP with Average Cost

In this paper, we propose a new lower approximation scheme for POMDP wit...
research
08/11/2022

Best Policy Identification in Linear MDPs

We investigate the problem of best policy identification in discounted l...
research
02/10/2023

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

We study the sample complexity of obtaining an ϵ-optimal policy in Robus...
research
05/04/2023

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

We investigate an infinite-horizon average reward Markov Decision Proces...

Please sign up or login with your details

Forgot password? Click here to reset