Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

05/13/2023
by   Tal Lancewicki, et al.
0

Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we study PO in adversarial MDPs with a challenge that arises in almost every real-world application – delayed bandit feedback. We give the first near-optimal regret bounds for PO in tabular MDPs, and may even surpass state-of-the-art (which uses less efficient methods). Our novel Delay-Adapted PO (DAPO) is easy to implement and to generalize, allowing us to extend our algorithm to: (i) infinite state space under the assumption of linear Q-function, proving the first regret bounds for delayed feedback with function approximation. (ii) deep RL, demonstrating its effectiveness in experiments on MuJoCo domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2022

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents ob...
research
02/19/2020

Optimistic Policy Optimization with Bandit Feedback

Policy optimization methods are one of the most widely used classes of R...
research
02/02/2023

Average-Constrained Policy Optimization

Reinforcement Learning (RL) with constraints is becoming an increasingly...
research
07/18/2021

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. D...
research
05/18/2022

Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs

Reinforcement learning (RL) generalizes bandit problems with additional ...
research
11/15/2021

Delayed Feedback in Episodic Reinforcement Learning

There are many provably efficient algorithms for episodic reinforcement ...
research
11/28/2022

Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation

We consider a multi-agent episodic MDP setup where an agent (leader) tak...

Please sign up or login with your details

Forgot password? Click here to reset