Imitation-Regularized Offline Learning

01/15/2019
by   Yifei Ma, et al.
12

We study the problem of offline learning in automated decision systems under the contextual bandits model. We are given logged historical data consisting of contexts, (randomized) actions, and (nonnegative) rewards. A common goal is to evaluate what would happen if different actions were taken in the same contexts, so as to optimize the action policies accordingly. The typical approach to this problem, inverse probability weighted estimation (IPWE) [Bottou et al., 2013], requires logged action probabilities, which may be missing in practice due to engineering complications. Even when available, small action probabilities cause large uncertainty in IPWE, rendering the corresponding results insignificant. To solve both problems, we show how one can use policy improvement (PIL) objectives, regularized by policy imitation (IML). We motivate and analyze PIL as an extension to Clipped-IPWE, by showing that both are lower-bound surrogates to the vanilla IPWE. We also formally connect IML to IPWE variance estimation [Swaminathan and Joachims 2015] and natural policy gradients. Without probability logging, our PIL-IML interpretations justify and improve, by reward-weighting, the state-of-art cross-entropy (CE) loss that predicts the action items among all action candidates available in the same contexts. With probability logging, our main theoretical contribution connects IML-underfitting to the existence of either confounding variables or model misspecification. We show the value and accuracy of our insights by simulations based on Simpson's paradox, standard UCI multiclass-to-bandit conversions and on the Criteo counterfactual analysis challenge dataset.

READ FULL TEXT

page 19

page 20

page 21

research
07/24/2021

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

We address policy learning with logged data in contextual bandits. Curre...
research
03/10/2015

Doubly Robust Policy Evaluation and Optimization

We study sequential decision making in environments where rewards are on...
research
03/23/2011

Doubly Robust Policy Evaluation and Learning

We study decision making in environments where the reward is only partia...
research
02/12/2018

Policy Gradients for Contextual Bandits

We study a generalized contextual-bandits problem, where there is a stat...
research
06/16/2020

Off-policy Bandits with Deficient Support

Learning effective contextual-bandit policies from past actions of a dep...
research
03/20/2023

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

We study the offline contextual bandit problem, where we aim to acquire ...
research
03/17/2021

Regularized Behavior Value Estimation

Offline reinforcement learning restricts the learning process to rely on...

Please sign up or login with your details

Forgot password? Click here to reset