AlgaeDICE: Policy Gradient from Arbitrary Experience

12/04/2019
by   Ofir Nachum, et al.
0

In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.

READ FULL TEXT

page 12

page 25

research
06/17/2019

Is the Policy Gradient a Gradient?

The policy gradient theorem describes the gradient of the expected disco...
research
12/10/2022

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

We revisit the domain of off-policy policy optimization in RL from the p...
research
11/16/2019

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of ...
research
10/19/2021

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Entropy regularization is an efficient technique for encouraging explora...
research
05/11/2023

Policy Gradient Algorithms Implicitly Optimize by Continuation

Direct policy optimization in reinforcement learning is usually solved w...
research
04/24/2019

Towards Combining On-Off-Policy Methods for Real-World Applications

In this paper, we point out a fundamental property of the objective in r...
research
11/02/2022

Dual Generator Offline Reinforcement Learning

In offline RL, constraining the learned policy to remain close to the da...

Please sign up or login with your details

Forgot password? Click here to reset