Distributional Robust Batch Contextual Bandits

06/10/2020
by   Nian Si, et al.
17

Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data–an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributional robust policy with incomplete (bandit) observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and then give a performance guarantee based on the theory of uniform convergence. Additionally, we also give a heuristic algorithm to solve the distributional robust policy learning problems efficiently.

READ FULL TEXT

page 17

page 18

research
05/28/2023

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Dynamic decision making under distributional shifts is of fundamental in...
research
10/10/2018

Offline Multi-Action Policy Learning: Generalization and Optimization

In many settings, a decision-maker wishes to learn a rule, or policy, th...
research
07/04/2020

Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design

Motivated by practical needs such as large-scale learning, we study the ...
research
11/13/2019

Triply Robust Off-Policy Evaluation

We propose a robust regression approach to off-policy evaluation (OPE) f...
research
05/25/2021

Robust Value Iteration for Continuous Control Tasks

When transferring a control policy from simulation to a physical system,...
research
10/23/2020

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

The goal of off-policy evaluation (OPE) is to evaluate a new policy usin...
research
01/08/2019

Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic

Learning a policy using only observational data is challenging because t...

Please sign up or login with your details

Forgot password? Click here to reset