Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration

01/31/2022
by   Chengzhuo Ni, et al.
0

Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2022

Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization

We analyze the convergence rate of the unregularized natural policy grad...
research
12/13/2015

Policy Gradient Methods for Off-policy Control

Off-policy learning refers to the problem of learning the value function...
research
10/27/2022

Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

Off-policy evaluation often refers to two related tasks: estimating the ...
research
10/20/2019

From Importance Sampling to Doubly Robust Policy Gradient

We show that policy gradient (PG) and its variance reduction variants ca...
research
01/26/2023

Partial advantage estimator for proximal policy optimization

Estimation of value in policy gradient methods is a fundamental problem....
research
02/10/2020

Statistically Efficient Off-Policy Policy Gradients

Policy gradient methods in reinforcement learning update policy paramete...
research
02/20/2023

Improving Deep Policy Gradients with Value Function Search

Deep Policy Gradient (PG) algorithms employ value networks to drive the ...

Please sign up or login with your details

Forgot password? Click here to reset