A nearly Blackwell-optimal policy gradient method

05/28/2021
by   Vektor Dewanto, et al.
16

For continuing environments, reinforcement learning methods commonly maximize a discounted reward criterion with discount factor close to 1 in order to approximate the steady-state reward (the gain). However, such a criterion only considers the long-run performance, ignoring the transient behaviour. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias, and its preconditioning Fisher matrix. We further propose an algorithm that solves the corresponding bi-level optimization using a logarithmic barrier. Experimental results provide insights into the fundamental mechanisms of our proposal.

READ FULL TEXT

page 8

page 24

page 25

page 26

research
01/22/2022

Bag of Tricks for Natural Policy Gradient Reinforcement Learning

Natural policy gradient methods are popular reinforcement learning metho...
research
04/08/2022

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...
research
02/11/2021

Robust Policy Gradient against Strong Data Corruption

We study the problem of robust reinforcement learning under adversarial ...
research
01/10/2013

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

There exist a number of reinforcement learning algorithms which learnby ...
research
07/20/2019

Potential-Based Advice for Stochastic Policy Learning

This paper augments the reward received by a reinforcement learning agen...
research
09/19/2019

Revisit Policy Optimization in Matrix Form

In tabular case, when the reward and environment dynamics are known, pol...
research
03/14/2022

Optimal Aggregation Strategies for Social Learning over Graphs

Adaptive social learning is a useful tool for studying distributed decis...

Please sign up or login with your details

Forgot password? Click here to reset