Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

06/06/2020
by   Nathan Kallus, et al.
15

Offline reinforcement learning, wherein one uses off-policy data logged by a fixed behavior policy to evaluate and learn new policies, is crucial in applications where experimentation is limited such as medicine. We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. Targeting deterministic policies, for which action is a deterministic function of state, is crucial since optimal policies are always deterministic (up to ties). In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. To circumvent this issue, we propose several new doubly robust estimators based on different kernelization approaches. We analyze the asymptotic mean-squared error of each of these under mild rate conditions for nuisance estimators. Specifically, we demonstrate how to obtain a rate that is independent of the horizon length.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2022

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions

We consider local kernel metric learning for off-policy evaluation (OPE)...
research
09/09/2019

Deterministic Value-Policy Gradients

Reinforcement learning algorithms such as the deep deterministic policy ...
research
11/29/2020

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

Off-policy evaluation is a key component of reinforcement learning which...
research
08/16/2016

Fast Calculation of the Knowledge Gradient for Optimization of Deterministic Engineering Simulations

A novel efficient method for computing the Knowledge-Gradient policy for...
research
06/08/2019

Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning ...
research
07/31/2022

Sampling, Communication, and Prediction Co-Design for Synchronizing the Real-World Device and Digital Model in Metaverse

The metaverse has the potential to revolutionize the next generation of ...
research
06/25/2023

Inference for relative sparsity

In healthcare, there is much interest in estimating policies, or mapping...

Please sign up or login with your details

Forgot password? Click here to reset