Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

07/13/2020
by   Shicong Cen, et al.
9

Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization – an algorithmic scheme that helps encourage exploration – and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited even for the tabular setting. This paper develops non-asymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly – or even quadratically once it enters a local region around the optimal policy – when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation, and is able to find an ϵ-optimal policy for the original MDP when applied to a slightly perturbed MDP. Our convergence results outperform the ones established for unregularized NPG methods (arXiv:1908.00261), and shed light upon the role of entropy regularization in accelerating convergence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2021

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

The softmax policy gradient (PG) method, which performs gradient ascent ...
research
05/22/2017

A unified view of entropy-regularized Markov decision processes

We propose a general framework for entropy-regularized average-reward re...
research
10/19/2021

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Entropy regularization is an efficient technique for encouraging explora...
research
10/05/2021

Quasi-Newton policy gradient algorithms

Policy gradient algorithms have been widely applied to reinforcement lea...
research
06/10/2022

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

We study policy optimization for Markov decision processes (MDPs) with m...
research
09/20/2019

On the Convergence of Approximate and Regularized Policy Iteration Schemes

Algorithms based on the entropy regularized framework, such as Soft Q-le...
research
03/17/2021

Near Optimal Policy Optimization via REPS

Since its introduction a decade ago, relative entropy policy search (REP...

Please sign up or login with your details

Forgot password? Click here to reset