Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

11/29/2020
by   Jinlin Lai, et al.
0

Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. It is a crucial step towards safe reinforcement learning and has been used in advertisement, recommender systems and many other applications. In these applications, sometimes the offline data is collected from multiple behavior policies. Previous works regard data from different behavior policies equally. Nevertheless, some behavior policies are better at producing good estimators while others are not. This paper starts with discussing how to correctly mix estimators produced by different behavior policies. We propose three ways to reduce the variance of the mixture estimator when all sub-estimators are unbiased or asymptotically unbiased. Furthermore, experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2023

On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender Systems

Reinforcement learning serves as a potent tool for modeling dynamic user...
research
06/06/2020

Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning

We study the efficient off-policy evaluation of natural stochastic polic...
research
06/06/2020

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Offline reinforcement learning, wherein one uses off-policy data logged ...
research
06/26/2023

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Ranking interfaces are everywhere in online platforms. There is thus an ...
research
01/25/2021

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using pr...
research
06/15/2021

Control Variates for Slate Off-Policy Evaluation

We study the problem of off-policy evaluation from batched contextual ba...
research
10/28/2021

Sayer: Using Implicit Feedback to Optimize System Policies

We observe that many system policies that make threshold decisions invol...

Please sign up or login with your details

Forgot password? Click here to reset