The Role of Baselines in Policy Gradient Optimization

01/16/2023
by   Jincheng Mei, et al.
12

We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the state value baseline allows on-policy stochastic natural policy gradient (NPG) to converge to a globally optimal policy at an O(1/t) rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction cannot explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to reduce the aggressiveness of the updates rather than their variance. That is, we demonstrate that a finite variance is not necessary for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2020

On the Global Convergence Rates of Softmax Policy Gradient Methods

We make three contributions toward better understanding policy gradient ...
research
06/04/2022

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Neural replicator dynamics (NeuRD) is an alternative to the foundational...
research
10/29/2021

Understanding the Effect of Stochasticity in Policy Optimization

We study the effect of stochasticity in on-policy policy optimization, a...
research
07/11/2021

Coordinate-wise Control Variates for Deep Policy Gradients

The control variates (CV) method is widely used in policy gradient estim...
research
07/06/2018

Variance Reduction for Reinforcement Learning in Input-Driven Environments

We consider reinforcement learning in input-driven environments, where a...
research
02/27/2018

The Mirage of Action-Dependent Baselines in Reinforcement Learning

Policy gradient methods are a widely used class of model-free reinforcem...
research
12/27/2022

Variance Reduction for Score Functions Using Optimal Baselines

Many problems involve the use of models which learn probability distribu...

Please sign up or login with your details

Forgot password? Click here to reset