An Off-policy Policy Gradient Theorem Using Emphatic Weightings

11/22/2018
by   Ehsan Imani, et al.
0

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithmx2014called Actor Critic with Emphatic weightings (ACE)x2014that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methodsx2014particularly OffPAC and DPGx2014converge to the wrong solution whereas ACE finds the optimal solution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2021

Off-Policy Actor-Critic with Emphatic Weightings

A variety of theoretically-sound policy gradient algorithms exist for th...
research
02/15/2022

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is depend...
research
09/15/2022

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Abstraction has been widely studied as a way to improve the efficiency a...
research
02/04/2022

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage o...
research
05/17/2021

Controlling an Inverted Pendulum with Policy Gradient Methods-A Tutorial

This paper provides the details of implementing two important policy gra...
research
01/28/2020

Parameter Sharing in Coagent Networks

In this paper, we aim to prove the theorem that generalizes the Coagent ...
research
03/08/2022

Graph Reinforcement Learning for Predictive Power Allocation to Mobile Users

Allocating resources with future channels can save resource to ensure qu...

Please sign up or login with your details

Forgot password? Click here to reset