Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

06/04/2022
by   Dustin Morrill, et al.
1

Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high probability and the benefits this entails for NeuRD-CIX in sequential decision-making settings. Our analysis reveals a bias–variance tradeoff between SPG and NeuRD, and shows how theory predicts that NeuRD-CIX will perform well more consistently than NeuRD while retaining NeuRD's advantages over SPG in non-stationary environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/16/2023

The Role of Baselines in Policy Gradient Optimization

We study the effect of baselines in on-policy stochastic policy gradient...
research
10/02/2019

Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

We study the variance of the REINFORCE policy gradient estimator in envi...
research
08/22/2023

Careful at Estimation and Bold at Exploration

Exploration strategies in continuous action space are often heuristic du...
research
04/08/2019

Samples are not all useful: Denoising policy gradient updates using variance

Policy gradient algorithms in reinforcement learning rely on efficiently...
research
06/28/2020

Deep Bayesian Quadrature Policy Optimization

We study the problem of obtaining accurate policy gradient estimates. Th...
research
06/01/2019

Neural Replicator Dynamics

In multiagent learning, agents interact in inherently nonstationary envi...
research
02/17/2020

Differentiable Bandit Exploration

We learn bandit policies that maximize the average reward over bandit in...

Please sign up or login with your details

Forgot password? Click here to reset