Safe Counterfactual Reinforcement Learning

02/20/2020
by   Yusuke Narita, et al.
0

We develop a method for predicting the performance of reinforcement learning and bandit algorithms, given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at the fast √(N) rate, as the sample size N increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are really important. These theoretical guarantees make our estimator safe to use. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than state-of-the-art methods.

READ FULL TEXT

page 7

page 8

research
04/04/2016

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a re...
research
09/10/2018

Efficient Counterfactual Learning from Bandit Feedback

What is the most statistically efficient way to do off-policy evaluation...
research
12/04/2022

Counterfactual Learning with General Data-generating Policies

Off-policy evaluation (OPE) attempts to predict the performance of count...
research
05/13/2023

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

In an offline reinforcement learning setting, the safe policy improvemen...
research
09/20/2022

Counterfactual Mean-variance Optimization

We study a new class of estimands in causal inference, which are the sol...
research
08/17/2020

A Large-scale Open Dataset for Bandit Algorithms

We build and publicize the Open Bandit Dataset and Pipeline to facilitat...
research
06/10/2019

Degrees of Freedom Analysis of Unrolled Neural Networks

Unrolled neural networks emerged recently as an effective model for lear...

Please sign up or login with your details

Forgot password? Click here to reset