Active Offline Policy Selection

by   Ksenia Konyushkova, et al.

This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and healthcare domain among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. To reduce this gap, we introduce a novel active offline policy selection problem formulation, which combined logged data and limited online interactions to identify the best policy. We rely on the advances in OPE to warm start the evaluation. We build upon Bayesian optimization to iteratively decide which policies to evaluate in order to utilize the limited environment interactions wisely. Many candidate policies could be proposed, thus, we focus on making our approach scalable and introduce a kernel function to model similarity between policies. We use several benchmark environments to show that the proposed approach improves upon state-of-the-art OPE estimates and fully online policy evaluation with limited budget. Additionally, we show that each component of the proposed method is important, it works well with various number and quality of OPE estimates and even with a large number of candidate policies.


page 6

page 18


Offline Policy Comparison under Limited Historical Agent-Environment Interactions

We address the challenge of policy evaluation in real-world applications...

Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

Pre-training with offline data and online fine-tuning using reinforcemen...

Benchmarks for Deep Off-Policy Evaluation

Off-policy evaluation (OPE) holds the promise of being able to leverage ...

Marketing Budget Allocation with Offline Constrained Deep Reinforcement Learning

We study the budget allocation problem in online marketing campaigns tha...

Well-being policy evaluation methodology based on WE pluralism

Methodologies for evaluating and selecting policies that contribute to t...

Task Selection Policies for Multitask Learning

One of the questions that arises when designing models that learn to sol...

Please sign up or login with your details

Forgot password? Click here to reset