Reward-estimation variance elimination in sequential decision processes

11/15/2018
by   Sergey Pankov, et al.
0

Policy gradient methods are very attractive in reinforcement learning due to their model-free nature and convergence guarantees. These methods, however, suffer from high variance in gradient estimation, resulting in poor sample efficiency. To mitigate this issue, a number of variance-reduction approaches have been proposed. Unfortunately, in the challenging problems with delayed rewards, these approaches either bring a relatively modest improvement or do reduce variance at expense of introducing a bias and undermining convergence. The unbiased methods of gradient estimation, in general, only partially reduce variance, without eliminating it completely even in the limit of exact knowledge of the value functions and problem dynamics, as one might have wished. In this work we propose an unbiased method that does completely eliminate variance under some, commonly encountered, conditions. Of practical interest is the limit of deterministic dynamics and small policy stochasticity. In the case of a quadratic value function, as in linear quadratic Gaussian models, the policy randomness need not be small. We use such a model to analyze performance of the proposed variance-elimination approach and compare it with standard variance-reduction methods. The core idea behind the approach is to use control variates at all future times down the trajectory. We present both a model-based and model-free formulations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2017

Stochastic Variance Reduction for Policy Gradient Estimation

Recent advances in policy gradient methods and deep learning have demons...
research
01/10/2013

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

There exist a number of reinforcement learning algorithms which learnby ...
research
11/30/2021

Global Convergence Using Policy Gradient Methods for Model-free Markovian Jump Linear Quadratic Control

Owing to the growth of interest in Reinforcement Learning in the last fe...
research
11/28/2020

Approximate Midpoint Policy Iteration for Linear Quadratic Control

We present a midpoint policy iteration algorithm to solve linear quadrat...
research
04/17/2018

Regret Bounds for Model-Free Linear Quadratic Control

Model-free approaches for reinforcement learning (RL) and continuous con...
research
06/11/2020

Borrowing From the Future: Addressing Double Sampling in Model-free Control

In model-free reinforcement learning, the temporal difference method and...
research
07/06/2018

Variance Reduction for Reinforcement Learning in Input-Driven Environments

We consider reinforcement learning in input-driven environments, where a...

Please sign up or login with your details

Forgot password? Click here to reset