Supervised Off-Policy Ranking

07/03/2021
by   Yue Jin, et al.
1

Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. Previous OPE methods mainly focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is actually a much simpler task than evaluating their true performance; and (2) there are usually multiple policies that have been deployed in real-world systems and thus whose true performance is known through serving real users. Inspired by the two observations, in this work, we define a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-policy data and policies with known performance. We further propose a method for supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance rather than estimating their precise performance. Our method leverages logged states and policies to learn a Transformer based model that maps offline interaction data including logged states and the actions taken by a target policy on these states to a score. Experiments on different games, datasets, training policy sets, and test policy sets show that our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies. Furthermore, our method is more stable than baseline methods.

READ FULL TEXT

page 22

page 23

research
01/30/2023

Designing an offline reinforcement learning objective from scratch

Offline reinforcement learning has developed rapidly over the recent yea...
research
06/07/2021

Offline Policy Comparison under Limited Historical Agent-Environment Interactions

We address the challenge of policy evaluation in real-world applications...
research
04/05/2023

Conformal Off-Policy Evaluation in Markov Decision Processes

Reinforcement Learning aims at identifying and evaluating efficient cont...
research
08/22/2022

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Evolution Strategy (ES) is a powerful black-box optimization technique b...
research
10/23/2020

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-policy evaluation (OPE) is the problem of estimating the value of a ...
research
05/08/2023

Well-being policy evaluation methodology based on WE pluralism

Methodologies for evaluating and selecting policies that contribute to t...

Please sign up or login with your details

Forgot password? Click here to reset