Overfitting and Optimization in Offline Policy Learning

06/27/2020
by   David Brandfonbrener, et al.
11

We consider the task of policy learning from an offline dataset generated by some behavior policy. We analyze the two most prominent families of algorithms for this task: policy optimization and Q-learning. We demonstrate that policy optimization suffers from two problems, overfitting and spurious minima, that do not appear in Q-learning or full-feedback problems (i.e. cost-sensitive classification). Specifically, we describe the phenomenon of “bandit overfitting” in which an algorithm overfits based on the actions observed in the dataset, and show that it affects policy optimization but not Q-learning. Moreover, we show that the policy optimization objective suffers from spurious minima even with linear policies, whereas the Q-learning objective is convex for linear models. We empirically verify the existence of both problems in realistic datasets with neural network models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2022

Offline Policy Optimization with Eligible Actions

Offline policy optimization could have a large impact on many real-world...
research
05/24/2022

Learning Stabilizing Policies in Stochastic Control Systems

In this work, we address the problem of learning provably stable neural ...
research
05/21/2020

Off-policy Learning for Remote Electrical Tilt Optimization

We address the problem of Remote Electrical Tilt (RET) optimization usin...
research
11/29/2022

Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning

Offline reinforcement learning (RL) have received rising interest due to...
research
03/17/2020

Multi-action Offline Policy Learning with Bayesian Optimization

We study an offline multi-action policy learning algorithm based on doub...
research
01/30/2023

Designing an offline reinforcement learning objective from scratch

Offline reinforcement learning has developed rapidly over the recent yea...
research
09/22/2022

Proximal Point Imitation Learning

This work develops new algorithms with rigorous efficiency guarantees fo...

Please sign up or login with your details

Forgot password? Click here to reset