Improved Regret Bound and Experience Replay in Regularized Policy Iteration

02/25/2021
by   Nevena Lazic, et al.
0

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from O(T^3/4) to O(√(T)) under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability O(√(T)) regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We develop several new algorithms for learning Markov Decision Processes...
research
01/31/2022

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov ...
research
02/08/2020

Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value functio...
research
06/01/2022

Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning

The Q-learning algorithm is a simple and widely-used stochastic approxim...
research
02/27/2023

Optimistic Planning by Regularized Dynamic Programming

We propose a new method for optimistic planning in infinite-horizon disc...
research
06/09/2022

Regret Bounds for Information-Directed Reinforcement Learning

Information-directed sampling (IDS) has revealed its potential as a data...
research
12/08/2021

Convergence Results For Q-Learning With Experience Replay

A commonly used heuristic in RL is experience replay (e.g. <cit.>), in w...

Please sign up or login with your details

Forgot password? Click here to reset