Learning Action Embeddings for Off-Policy Evaluation

05/06/2023
by   Matej Cief, et al.
0

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2022

Off-Policy Evaluation for Large Action Spaces via Embeddings

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoptio...
research
04/24/2023

Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics

A common assumption when training embodied agents is that the impact of ...
research
03/05/2022

Off-Policy Evaluation in Embedded Spaces

Off-policy evaluation methods are important in recommendation systems an...
research
07/13/2023

Leveraging Factored Action Spaces for Off-Policy Evaluation

Off-policy evaluation (OPE) aims to estimate the benefit of following a ...
research
05/14/2023

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

We study off-policy evaluation (OPE) of contextual bandit policies for l...
research
06/28/2023

DCT: Dual Channel Training of Action Embeddings for Reinforcement Learning with Large Discrete Action Spaces

The ability to learn robust policies while generalizing over large discr...
research
05/03/2019

Information asymmetry in KL-regularized RL

Many real world tasks exhibit rich structure that is repeated across dif...

Please sign up or login with your details

Forgot password? Click here to reset