The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

01/10/2013
by   Lex Weaver, et al.
0

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

READ FULL TEXT

page 1

page 2

page 5

page 7

page 8

research
02/27/2018

The Mirage of Action-Dependent Baselines in Reinforcement Learning

Policy gradient methods are a widely used class of model-free reinforcem...
research
01/03/2017

A K-fold Method for Baseline Estimation in Policy Gradient Algorithms

The high variance issue in unbiased policy-gradient methods such as VPG ...
research
11/15/2018

Reward-estimation variance elimination in sequential decision processes

Policy gradient methods are very attractive in reinforcement learning du...
research
07/06/2018

Variance Reduction for Reinforcement Learning in Input-Driven Environments

We consider reinforcement learning in input-driven environments, where a...
research
05/28/2021

A nearly Blackwell-optimal policy gradient method

For continuing environments, reinforcement learning methods commonly max...
research
02/16/2023

Analytically Tractable Models for Decision Making under Present Bias

Time-inconsistency is a characteristic of human behavior in which people...
research
09/26/2022

Learning GFlowNets from partial episodes for improved convergence and stability

Generative flow networks (GFlowNets) are a family of algorithms for trai...

Please sign up or login with your details

Forgot password? Click here to reset