Offline Policy Optimization in RL with Variance Regularizaton

12/29/2022
by   Riashat Islam, et al.
0

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2022

Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning

Offline reinforcement learning (RL) extends the paradigm of classical RL...
research
11/15/2022

Offline Reinforcement Learning with Adaptive Behavior Regularization

Offline reinforcement learning (RL) defines a sample-efficient learning ...
research
06/11/2021

Taylor Expansion of Discount Factors

In practical reinforcement learning (RL), the discount factor used for e...
research
10/06/2021

Mismatched No More: Joint Model-Policy Optimization for Model-Based RL

Many model-based reinforcement learning (RL) methods follow a similar te...
research
02/26/2022

Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons

We consider reinforcement learning (RL) methods in offline domains witho...
research
01/05/2023

Extreme Q-Learning: MaxEnt RL without Entropy

Modern Deep Reinforcement Learning (RL) algorithms require estimates of ...
research
10/12/2022

Efficient Offline Policy Optimization with a Learned Model

MuZero Unplugged presents a promising approach for offline policy learni...

Please sign up or login with your details

Forgot password? Click here to reset