Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

09/29/2021
by   Romain Laroche, et al.
0

The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2022

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is depend...
research
08/29/2019

Neural Policy Gradient Methods: Global Optimality and Rates of Convergence

Policy gradient methods with actor-critic schemes demonstrate tremendous...
research
10/30/2021

Convergence and Optimality of Policy Gradient Methods in Weakly Smooth Settings

Policy gradient methods have been frequently applied to problems in cont...
research
02/23/2021

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Designing off-policy reinforcement learning algorithms is typically a ve...
research
12/10/2022

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

We revisit the domain of off-policy policy optimization in RL from the p...
research
11/04/2021

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

In this paper, we establish the global optimality and convergence rate o...
research
12/11/2019

Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods

The policy gradient theorem is defined based on an objective with respec...

Please sign up or login with your details

Forgot password? Click here to reset