Efficient iterative policy optimization

12/28/2016
by   Nicolas Le Roux, et al.
0

We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2022

Under-Approximating Expected Total Rewards in POMDPs

We consider the problem: is the optimal expected total reward to reach a...
research
07/05/2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
research
06/25/2019

Optimistic Proximal Policy Optimization

Reinforcement Learning, a machine learning framework for training an aut...
research
10/24/2020

An Adiabatic Theorem for Policy Tracking with TD-learning

We evaluate the ability of temporal difference learning to track the rew...
research
05/05/2019

P3O: Policy-on Policy-off Policy Optimization

On-policy reinforcement learning (RL) algorithms have high sample comple...
research
10/01/2020

How Macroeconomists Lost Control of Stabilization Policy: Towards Dark Ages

This paper is a study of the history of the transplant of mathematical t...
research
08/12/2021

A functional mirror ascent view of policy gradient methods with function approximation

We use functional mirror ascent to propose a general framework (referred...

Please sign up or login with your details

Forgot password? Click here to reset