Balanced Off-Policy Evaluation in General Action Spaces

06/09/2019
by   Arjun Sondhi, et al.
0

In many practical applications of contextual bandits, online learning is infeasible and practitioners must rely on off-policy evaluation (OPE) of logged data collected from prior policies. OPE generally consists of a combination of two components: (i) directly estimating a model of the reward given state and action and (ii) importance sampling. While recent work has made significant advances adaptively combining these two components, less attention has been paid to improving the quality of the importance weights themselves. In this work we present balancing off-policy evaluation (BOP-e), an importance sampling procedure that directly optimizes for balance and can be plugged into any OPE estimator that uses importance sampling. BOP-e directly estimates the importance sampling ratio via a classifier which attempts to distinguish state-action pairs from an observed versus a proposed policy. BOP-e can be applied to continuous, mixed, and multi-valued action spaces without modification and is easily scalable to many observations. Further, we show that minimization of regret in the constructed binary classification problem translates directly into minimizing regret in the off-policy evaluation task. Finally, we provide experimental evidence that BOP-e outperforms inverse propensity weighting-based approaches for offline evaluation of policies in the contextual bandit setting under both discrete and continuous action spaces.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2019

Balanced Off-Policy Evaluation General Action Spaces

In many practical applications of contextual bandits, online learning is...
research
12/15/2020

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous...
research
02/16/2018

Policy Evaluation and Optimization with Continuous Treatments

We study the problem of policy evaluation and learning from batched cont...
research
06/03/2021

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Empirical risk minimization (ERM) is the workhorse of machine learning, ...
research
10/24/2022

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions

We consider local kernel metric learning for off-policy evaluation (OPE)...
research
10/31/2018

On Exploration, Exploitation and Learning in Adaptive Importance Sampling

We study adaptive importance sampling (AIS) as an online learning proble...
research
09/27/2020

Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setti...

Please sign up or login with your details

Forgot password? Click here to reset