On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

02/17/2021
by   Junyu Zhang, et al.
0

Policy gradient gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE. Yet the best known sample complexity result for such methods to find an ϵ-optimal policy is 𝒪(ϵ^-3), which is suboptimal. In this paper, we study the fundamental convergence properties and sample efficiency of first-order policy optimization method. We focus on a generalized variant of policy gradient method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. By exploiting the problem's hidden convex nature and leveraging techniques from composition optimization, we propose a Stochastic Incremental Variance-Reduced Policy Gradient (SIVR-PG) approach that improves a sequence of policies to provably converge to the global optimal solution and finds an ϵ-optimal policy using 𝒪̃(ϵ^-2) samples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2019

Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

Improving the sample efficiency in reinforcement learning has been a lon...
research
09/14/2020

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Off-policy policy optimization is a challenging problem in reinforcement...
research
02/11/2021

Robust Policy Gradient against Strong Data Corruption

We study the problem of robust reinforcement learning under adversarial ...
research
07/04/2020

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

In recent years, reinforcement learning (RL) systems with general goals ...
research
11/15/2022

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

In this paper, we revisit and improve the convergence of policy gradient...
research
06/15/2023

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

Policy optimization methods are powerful algorithms in Reinforcement Lea...
research
06/29/2021

Curious Explorer: a provable exploration strategy in Policy Learning

Having access to an exploring restart distribution (the so-called wide c...

Please sign up or login with your details

Forgot password? Click here to reset