POPO: Pessimistic Offline Policy Optimization

12/26/2020
by   Qiang He, et al.
0

Offline reinforcement learning (RL), also known as batch RL, aims to optimize policy from a large pre-recorded dataset without interaction with the environment. This setting offers the promise of utilizing diverse, pre-collected datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy algorithms based on Q-learning or actor-critic perform poorly when learning from a static dataset. In this work, we study why off-policy RL methods fail to learn in offline setting from the value function view, and we propose a novel offline RL algorithm that we call Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming several state-of-the-art offline RL algorithms on benchmark tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2020

Critic Regularized Regression

Offline reinforcement learning (RL), also known as batch RL, offers the ...
research
02/22/2023

Behavior Proximal Policy Optimization

Offline reinforcement learning (RL) is a challenging setting where exist...
research
06/19/2021

Boosting Offline Reinforcement Learning with Residual Generative Modeling

Offline reinforcement learning (RL) tries to learn the near-optimal poli...
research
07/01/2022

Offline Policy Optimization with Eligible Actions

Offline policy optimization could have a large impact on many real-world...
research
03/10/2021

S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning

Offline reinforcement learning proposes to learn policies from large col...
research
07/21/2020

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

Off-policy reinforcement learning (RL) holds the promise of sample-effic...
research
10/19/2022

Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation

A promising paradigm for offline reinforcement learning (RL) is to const...

Please sign up or login with your details

Forgot password? Click here to reset