Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

04/04/2016
by   Philip S. Thomas, et al.
0

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods---it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang and Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2021

SOPE: Spectrum of Off-Policy Estimators

Many sequential decision making problems are high-stakes and require off...
research
02/20/2020

Safe Counterfactual Reinforcement Learning

We develop a method for predicting the performance of reinforcement lear...
research
11/15/2019

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Off-policy policy evaluation (OPE) is the problem of estimating the onli...
research
06/08/2019

Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning ...
research
09/03/2022

Estimating Demand for Online Delivery using Limited Historical Observations

Driven in part by the COVID-19 pandemic, the pace of online purchases fo...
research
06/12/2020

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

This study addresses the problem of off-policy evaluation (OPE) from dep...
research
08/01/2018

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...

Please sign up or login with your details

Forgot password? Click here to reset