Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs

05/18/2022
by   Ian A. Kash, et al.
0

Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, any slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example,  (Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2022

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents ob...
research
05/13/2023

Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

Policy Optimization (PO) is one of the most popular methods in Reinforce...
research
10/07/2021

A Model Selection Approach for Corruption Robust Reinforcement Learning

We develop a model selection approach to tackle reinforcement learning w...
research
05/15/2023

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

We derive a new analysis of Follow The Regularized Leader (FTRL) for onl...
research
09/28/2020

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Episodic reinforcement learning and contextual bandits are two widely st...
research
02/11/2021

Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously

In this work, we develop linear bandit algorithms that automatically ada...
research
06/10/2015

An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

We consider a contextual version of multi-armed bandit problem with glob...

Please sign up or login with your details

Forgot password? Click here to reset