Softmax Policy Gradient Methods Can Take Exponential Time to Converge

02/22/2021
by   Gen Li, et al.
6

The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For γ-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space 𝒮 and the effective horizon 1/1-γ, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that softmax PG methods can take exponential time – in terms of |𝒮| and 1/1-γ – to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration. This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2020

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Natural policy gradient (NPG) methods are among the most widely used pol...
research
05/13/2020

On the Global Convergence Rates of Softmax Policy Gradient Methods

We make three contributions toward better understanding policy gradient ...
research
10/22/2020

Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime

We study the problem of policy optimization for infinite-horizon discoun...
research
06/15/2022

Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

We study the performance of policy gradient methods for the subclass of ...
research
01/18/2022

Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

We study the global convergence of policy gradient for infinite-horizon,...
research
12/22/2021

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators for softmax policies are ineffective wit...
research
06/15/2021

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Reinforcement learning is a framework for interactive decision-making wi...

Please sign up or login with your details

Forgot password? Click here to reset