Data-Efficient Policy Evaluation Through Behavior Policy Search

06/12/2017
by   Josiah P. Hanna, et al.
0

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy --- the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2021

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation

This paper considers how to complement offline reinforcement learning (R...
research
09/19/2012

Comunication-Efficient Algorithms for Statistical Optimization

We analyze two communication-efficient algorithms for distributed statis...
research
11/25/2019

Modeling Variables with a Detection Limit using a Truncated Normal Distribution with Censoring

When data are collected subject to a detection limit, observations below...
research
03/09/2022

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

This paper studies the problem of data collection for policy evaluation ...
research
12/29/2022

Gaussian Heteroskedastic Empirical Bayes without Independence

In this note, we propose empirical Bayes methods under heteroskedastic G...
research
08/01/2018

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...
research
12/07/2018

Multitaper estimation on arbitrary domains

Multitaper estimators have enjoyed significant success in providing spec...

Please sign up or login with your details

Forgot password? Click here to reset