On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly-Communicating MDPs

09/30/2022
by   Yi Wan, et al.
0

We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, & Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas & Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, & Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2021

Average-Reward Learning and Planning with Options

We extend the options framework for temporal abstraction in reinforcemen...
research
01/31/2023

Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor

We introduce the Blackwell discount factor for Markov Decision Processes...
research
01/08/2021

Average-Reward Off-Policy Policy Evaluation with Function Approximation

We consider off-policy policy evaluation with function approximation (FA...
research
02/14/2012

A Geometric Traversal Algorithm for Reward-Uncertain MDPs

Markov decision processes (MDPs) are widely used in modeling decision ma...
research
11/12/2021

Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity

Reinforcement learning algorithms often require finiteness of state and ...
research
03/09/2022

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

This paper studies the problem of data collection for policy evaluation ...
research
07/11/2022

Cluster-Based Control of Transition-Independent MDPs

This work studies the ability of a third-party influencer to control the...

Please sign up or login with your details

Forgot password? Click here to reset