Provably Efficient Exploration in Policy Optimization

12/12/2019
by   Qi Cai, et al.
0

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves Õ(√(d^3 H^3 T)) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2020

Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

We study the Safe Reinforcement Learning (SRL) problem using the Constra...
research
11/01/2019

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

We consider the exploration-exploitation dilemma in finite-horizon reinf...
research
05/24/2021

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

Policy optimization, which learns the policy of interest by maximizing t...
research
06/06/2016

Learning to Optimize

Algorithm design is a laborious process and often requires many iteratio...
research
05/19/2020

Riemannian Proximal Policy Optimization

In this paper, We propose a general Riemannian proximal optimization alg...
research
06/15/2023

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

Policy optimization methods are powerful algorithms in Reinforcement Lea...
research
02/19/2021

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

In offline reinforcement learning (RL) an optimal policy is learnt solel...

Please sign up or login with your details

Forgot password? Click here to reset