Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions

10/24/2022
by   Haanvid Lee, et al.
0

We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2020

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Offline reinforcement learning, wherein one uses off-policy data logged ...
research
02/13/2022

Off-Policy Evaluation for Large Action Spaces via Embeddings

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoptio...
research
06/09/2019

Balanced Off-Policy Evaluation General Action Spaces

In many practical applications of contextual bandits, online learning is...
research
06/09/2019

Balanced Off-Policy Evaluation in General Action Spaces

In many practical applications of contextual bandits, online learning is...
research
06/13/2023

Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits

We consider policy optimization in contextual bandits, where one is give...
research
02/16/2018

Policy Evaluation and Optimization with Continuous Treatments

We study the problem of policy evaluation and learning from batched cont...
research
07/13/2023

Leveraging Factored Action Spaces for Off-Policy Evaluation

Off-policy evaluation (OPE) aims to estimate the benefit of following a ...

Please sign up or login with your details

Forgot password? Click here to reset