Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning

07/03/2021
by   Lionel Blondé, et al.
10

The performance of state-of-the-art baselines in the offline RL regime varies widely over the spectrum of dataset qualities, ranging from "far-from-optimal" random data to "close-to-optimal" expert demonstrations. We re-implement these under a fair, unified, and highly factorized framework, and show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end. This consistent trend prevents us from naming a victor that outperforms the rest across the board. We attribute the asymmetry in performance between the two ends of the quality spectrum to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. The more bias is injected, the higher the agent performs, provided the dataset is close-to-optimal. Otherwise, its effect is brutally detrimental. Adopting an advantage-weighted regression template as base, we conduct an investigation which corroborates that injections of such optimality inductive bias, when not done parsimoniously, makes the agent subpar in the datasets it was dominant as soon as the offline policy is sub-optimal. In an effort to design methods that perform well across the whole spectrum, we revisit the generalized policy iteration scheme for the offline regime, and study the impact of nine distinct newly-introduced proposal distributions over actions, involved in proposed generalization of the policy evaluation and policy improvement update rules. We show that certain orchestrations strike the right balance and can improve the performance on one end of the spectrum without harming it on the other end.

READ FULL TEXT

page 11

page 16

page 24

page 36

page 37

research
03/14/2023

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Conventional reinforcement learning (RL) needs an environment to collect...
research
06/09/2023

In-Sample Policy Iteration for Offline Reinforcement Learning

Offline reinforcement learning (RL) seeks to derive an effective control...
research
10/13/2022

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

We consider a hybrid reinforcement learning setting (Hybrid RL), in whic...
research
11/21/2022

Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning

The ability to discover optimal behaviour from fixed data sets has the p...
research
03/16/2023

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Offline reinforcement learning (RL) aims to infer sequential decision po...
research
11/02/2022

Dual Generator Offline Reinforcement Learning

In offline RL, constraining the learned policy to remain close to the da...

Please sign up or login with your details

Forgot password? Click here to reset