Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

02/15/2022
by   Romain Laroche, et al.
0

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing q, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in 𝒪(t^-1) under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2018

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Policy gradient methods are widely used for control in reinforcement lea...
research
09/29/2021

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

The policy gradient theorem states that the policy should only be update...
research
11/05/2016

Combining policy gradient and Q-learning

Policy gradient is an efficient technique for improving a policy in a re...
research
09/18/2020

GRAC: Self-Guided and Self-Regularized Actor-Critic

Deep reinforcement learning (DRL) algorithms have successfully been demo...
research
11/04/2021

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

In this paper, we establish the global optimality and convergence rate o...
research
07/13/2019

Stochastic Convergence Results for Regularized Actor-Critic Methods

In this paper, we present a stochastic convergence proof, under suitable...
research
09/16/2022

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Provably efficient Model-Based Reinforcement Learning (MBRL) based on op...

Please sign up or login with your details

Forgot password? Click here to reset