Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

07/18/2021
by   Haipeng Luo, et al.
0

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain 𝒪(√(T)) regret where T is the number of episodes, improving the 𝒪(T^2/3) regret bound by Shani et al. (2020). When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain 𝒪(T^2/3) regret with the help of a simulator, matching the result of Neu and Olkhovskaya (2020) while importantly removing the need of an exploratory policy that their algorithm requires. When a simulator is unavailable, we further consider a linear MDP setting and obtain 𝒪(T^14/15) regret, which is the first result for linear MDPs with adversarial losses and bandit feedback.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

We study reinforcement learning with linear function approximation and a...
research
05/13/2023

Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

Policy Optimization (PO) is one of the most popular methods in Reinforce...
research
02/14/2023

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Learning Markov decision processes (MDP) in an adversarial environment h...
research
02/09/2021

RL for Latent MDPs: Regret Guarantees and a Lower Bound

In this work, we consider the regret minimization problem for reinforcem...
research
02/18/2023

Best of Both Worlds Policy Optimization

Policy optimization methods are popular reinforcement learning algorithm...
research
12/02/2021

Differentially Private Exploration in Reinforcement Learning with Linear Representation

This paper studies privacy-preserving exploration in Markov Decision Pro...
research
11/01/2021

Intervention Efficient Algorithm for Two-Stage Causal MDPs

We study Markov Decision Processes (MDP) wherein states correspond to ca...

Please sign up or login with your details

Forgot password? Click here to reset