Off-Policy Evaluation for Large Action Spaces via Embeddings

02/13/2022
by   Yuta Saito, et al.
0

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators – most of which are based on inverse propensity score weighting – degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.

READ FULL TEXT
research
08/07/2023

Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces

We study Off-Policy Evaluation (OPE) in contextual bandit settings with ...
research
05/06/2023

Learning Action Embeddings for Off-Policy Evaluation

Off-policy evaluation (OPE) methods allow us to compute the expected rew...
research
06/15/2021

Control Variates for Slate Off-Policy Evaluation

We study the problem of off-policy evaluation from batched contextual ba...
research
05/14/2023

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

We study off-policy evaluation (OPE) of contextual bandit policies for l...
research
06/16/2020

Off-policy Bandits with Deficient Support

Learning effective contextual-bandit policies from past actions of a dep...
research
06/26/2023

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Ranking interfaces are everywhere in online platforms. There is thus an ...
research
10/24/2022

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions

We consider local kernel metric learning for off-policy evaluation (OPE)...

Please sign up or login with your details

Forgot password? Click here to reset