DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

06/10/2019
by   Ofir Nachum, et al.
0

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2019

Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies

We consider off-policy policy evaluation when the trajectory data are ge...
research
02/21/2020

GenDICE: Generalized Offline Estimation of Stationary Values

An important problem that arises in reinforcement learning and Monte Car...
research
12/14/2022

Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction

We consider the problem of off-policy evaluation (OPE) in reinforcement ...
research
12/12/2020

Offline Policy Selection under Uncertainty

The presence of uncertainty in policy evaluation significantly complicat...
research
06/07/2021

Offline Policy Comparison under Limited Historical Agent-Environment Interactions

We address the challenge of policy evaluation in real-world applications...
research
04/05/2023

Conformal Off-Policy Evaluation in Markov Decision Processes

Reinforcement Learning aims at identifying and evaluating efficient cont...
research
07/07/2020

Off-Policy Evaluation via the Regularized Lagrangian

The recently proposed distribution correction estimation (DICE) family o...

Please sign up or login with your details

Forgot password? Click here to reset