GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

01/29/2020
by   Shangtong Zhang, et al.
51

We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems with GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced, so primal-dual algorithms are not guaranteed to find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2021

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Marginalized importance sampling (MIS), which measures the density ratio...
research
02/21/2020

GenDICE: Generalized Offline Estimation of Stationary Values

An important problem that arises in reinforcement learning and Monte Car...
research
02/13/2022

Supported Policy Optimization for Offline Reinforcement Learning

Policy constraint methods to offline reinforcement learning (RL) typical...
research
10/09/2019

Compatible features for Monotonic Policy Improvement

Recent policy optimization approaches have achieved substantial empirica...
research
05/26/2021

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

We propose a successive convex approximation based off-policy optimizati...
research
12/14/2022

Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction

We consider the problem of off-policy evaluation (OPE) in reinforcement ...
research
02/08/2022

Bingham Policy Parameterization for 3D Rotations in Reinforcement Learning

We propose a new policy parameterization for representing 3D rotations d...

Please sign up or login with your details

Forgot password? Click here to reset