Log In Sign Up

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation

by   Rujie Zhong, et al.

This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections.


page 1

page 2

page 3

page 4


ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

This paper studies the problem of data collection for policy evaluation ...

Data-Efficient Policy Evaluation Through Behavior Policy Search

We consider the task of evaluating a policy for a Markov decision proces...

Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons

We consider reinforcement learning (RL) methods in offline domains witho...

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

We consider the offline reinforcement learning (RL) setting where the ag...

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Most reinforcement learning (RL) algorithms assume online access to the ...

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...

On Instrumental Variable Regression for Deep Offline Policy Evaluation

We show that the popular reinforcement learning (RL) strategy of estimat...