Value-aware Importance Weighting for Off-policy Reinforcement Learning

06/27/2023
by   Kristopher De Asis, et al.
0

Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use of value-aware importance weights which take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2022

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

A central challenge to applying many off-policy reinforcement learning a...
research
06/11/2019

Importance Resampling for Off-policy Prediction

Importance sampling (IS) is a common reweighting strategy for off-policy...
research
03/30/2022

Marginalized Operators for Off-policy Reinforcement Learning

In this work, we propose marginalized operators, a new class of off-poli...
research
02/14/2022

Asymptotically Unbiased Estimation for Delayed Feedback Modeling via Label Correction

Alleviating the delayed feedback problem is of crucial importance for th...
research
10/16/2019

Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for o...
research
10/28/2019

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

We provide theoretical investigations into off-policy evaluation in rein...
research
06/07/2018

Unbiased Estimation of the Value of an Optimized Policy

Randomized trials, also known as A/B tests, are used to select between t...

Please sign up or login with your details

Forgot password? Click here to reset