Best of Both Worlds Policy Optimization

02/18/2023
by   Christoph Dann, et al.
0

Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving √(T) regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog(T) regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog(T) regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2020

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

This work studies the problem of learning episodic Markov Decision Proce...
research
10/03/2022

Square-root regret bounds for continuous-time episodic Markov decision processes

We study reinforcement learning for continuous-time Markov decision proc...
research
07/18/2021

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. D...
research
05/27/2023

No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions

Existing online learning algorithms for adversarial Markov Decision Proc...
research
09/15/2020

The Importance of Pessimism in Fixed-Dataset Policy Optimization

We study worst-case guarantees on the expected return of fixed-dataset p...
research
06/07/2019

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

This paper studies a recent proposal to use randomized value functions t...
research
06/08/2021

The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition

We consider the best-of-both-worlds problem for learning an episodic Mar...

Please sign up or login with your details

Forgot password? Click here to reset