Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

07/23/2020
by   Chen-Yu Wei, et al.
12

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O(√(T)) regret and another computationally efficient variant with O(T^3/4) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with O(√(T)) regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with O(T^2/3) regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

A Model Selection Approach for Corruption Robust Reinforcement Learning

We develop a model selection approach to tackle reinforcement learning w...
research
01/31/2022

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov ...
research
09/05/2023

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

In this paper, we consider an infinite horizon average reward Markov Dec...
research
02/25/2021

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

In this work, we study algorithms for learning in infinite-horizon undis...
research
05/11/2022

Stochastic first-order methods for average-reward Markov decision processes

We study the problem of average-reward Markov decision processes (AMDPs)...
research
10/12/2019

Thompson Sampling in Non-Episodic Restless Bandits

Restless bandit problems assume time-varying reward distributions of the...
research
04/07/2023

Full Gradient Deep Reinforcement Learning for Average-Reward Criterion

We extend the provably convergent Full Gradient DQN algorithm for discou...

Please sign up or login with your details

Forgot password? Click here to reset