Average reward reinforcement learning with unknown mixing times

05/23/2019
by   Tom Zahavy, et al.
0

We derive and analyze learning algorithms for policy evaluation, apprenticeship learning, and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/05/2020

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

We study the online restless bandit problem, where the state of each arm...
research
12/01/2022

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

This work considers the sample complexity of obtaining an ε-optimal poli...
research
10/05/2022

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

We consider episodic reinforcement learning in reward-mixing Markov deci...
research
11/07/2018

Policy Certificates: Towards Accountable Reinforcement Learning

The performance of a reinforcement learning algorithm can vary drastical...
research
01/15/2018

Mixing Time on the Kagome Lattice

We consider tilings of a closed region of the Kagome lattice (partition ...
research
04/11/2023

A Tale of Sampling and Estimation in Discounted Reinforcement Learning

The most relevant problems in discounted reinforcement learning involve ...
research
02/23/2023

Efficiently handling constraints with Metropolis-adjusted Langevin algorithm

In this study, we investigate the performance of the Metropolis-adjusted...

Please sign up or login with your details

Forgot password? Click here to reset