Offline Policy Comparison under Limited Historical Agent-Environment Interactions

06/07/2021
by   Anton Dereventsov, et al.
5

We address the challenge of policy evaluation in real-world applications of reinforcement learning systems where the available historical data is limited due to ethical, practical, or security considerations. This constrained distribution of data samples often leads to biased policy evaluation estimates. To remedy this, we propose that instead of policy evaluation, one should perform policy comparison, i.e. to rank the policies of interest in terms of their value based on available historical data. In addition we present the Limited Data Estimator (LDE) as a simple method for evaluating and comparing policies from a small number of interactions with the environment. According to our theoretical analysis, the LDE is shown to be statistically reliable on policy comparison tasks under mild assumptions on the distribution of the historical data. Additionally, our numerical experiments compare the LDE to other policy evaluation methods on the task of policy ranking and demonstrate its advantage in various settings.

READ FULL TEXT

page 8

page 9

page 22

page 24

page 25

page 29

research
06/18/2021

Active Offline Policy Selection

This paper addresses the problem of policy selection in domains with abu...
research
07/23/2019

Off-policy Learning for Multiple Loggers

It is well known that the historical logs are used for evaluating and le...
research
04/05/2023

Conformal Off-Policy Evaluation in Markov Decision Processes

Reinforcement Learning aims at identifying and evaluating efficient cont...
research
07/03/2021

Supervised Off-Policy Ranking

Off-policy evaluation (OPE) leverages data generated by other policies t...
research
05/22/2022

Offline Policy Comparison with Confidence: Benchmarks and Baselines

Decision makers often wish to use offline historical data to compare seq...
research
12/12/2020

Offline Policy Selection under Uncertainty

The presence of uncertainty in policy evaluation significantly complicat...
research
06/10/2019

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

In many real-world reinforcement learning applications, access to the en...

Please sign up or login with your details

Forgot password? Click here to reset