Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

11/16/2019
by   Riashat Islam, et al.
0

Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as extrapolation error. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a state distribution shift due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first estimate the state distribution based on features of the state, using a density estimator and then develop a novel constrained off-policy gradient objective that minimizes the state distribution shift. Our experimental results on continuous control tasks show that minimizing this distribution mismatch can significantly improve performance in most popular practical off-policy policy gradient algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2019

Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decisio...
research
06/06/2023

State Regularized Policy Optimization on Data with Dynamics Shift

In many real-world scenarios, Reinforcement Learning (RL) algorithms are...
research
12/10/2022

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

We revisit the domain of off-policy policy optimization in RL from the p...
research
12/04/2019

AlgaeDICE: Policy Gradient from Arbitrary Experience

In many real-world applications of reinforcement learning (RL), interact...
research
01/20/2023

Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

We revisit the estimation bias in policy gradients for the discounted ep...
research
06/07/2022

Generalized Data Distribution Iteration

To obtain higher sample efficiency and superior final performance simult...

Please sign up or login with your details

Forgot password? Click here to reset