Improved Exploration in Factored Average-Reward MDPs

09/09/2020
by   Mohammad Sadegh Talebi, et al.
0

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space 𝒳 and the state-space 𝒮 admit the respective factored forms of 𝒳 = ⊗_i=1^n 𝒳_i and 𝒮=⊗_i=1^m 𝒮_i, and the transition and reward functions are factored over 𝒳 and 𝒮. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of 𝒮_i's and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2020

Tightening Exploration in Upper Confidence Reinforcement Learning

The upper confidence reinforcement learning (UCRL2) strategy introduced ...
research
01/31/2022

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov ...
research
03/05/2018

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...
research
10/27/2021

Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection

We study the role of the representation of state-action value functions ...
research
03/02/2023

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

We present the OMG-CMDP! algorithm for regret minimization in adversaria...
research
10/09/2019

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Leveraging an equivalence property in the state-space of a Markov Decisi...
research
11/27/2022

Counterfactual Optimism: Rate Optimal Regret for Stochastic Contextual MDPs

We present the UC^3RL algorithm for regret minimization in Stochastic Co...

Please sign up or login with your details

Forgot password? Click here to reset