Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

06/10/2020
by   Tiancheng Jin, et al.
0

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a “best-of-both-worlds” guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with Õ(√(T)) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves Õ(√(C)) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition

We consider the best-of-both-worlds problem for learning an episodic Mar...
research
02/18/2023

Best of Both Worlds Policy Optimization

Policy optimization methods are popular reinforcement learning algorithm...
research
07/20/2021

Best-of-All-Worlds Bounds for Online Learning with Feedback Graphs

We study the online learning with feedback graphs framework introduced b...
research
01/03/2013

Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy t...
research
05/27/2023

No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions

Existing online learning algorithms for adversarial Markov Decision Proc...
research
02/20/2023

A Blackbox Approach to Best of Both Worlds in Bandits and Beyond

Best-of-both-worlds algorithms for online learning which achieve near-op...
research
10/14/2019

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

We propose a new algorithm for adversarial multi-armed bandits with unre...

Please sign up or login with your details

Forgot password? Click here to reset