A Unified Off-Policy Evaluation Approach for General Value Function

07/06/2021
by   Tengyu Xu, et al.
0

General Value Function (GVF) is a powerful tool to represent both the predictive and retrospective knowledge in reinforcement learning (RL). In practice, often multiple interrelated GVFs need to be evaluated jointly with pre-collected off-policy samples. In the literature, the gradient temporal difference (GTD) learning method has been adopted to evaluate GVFs in the off-policy setting, but such an approach may suffer from a large estimation error even if the function approximation class is sufficiently expressive. Moreover, none of the previous work have formally established the convergence guarantee to the ground truth GVFs under the function approximation settings. In this paper, we address both issues through the lens of a class of GVFs with causal filtering, which cover a wide range of RL applications such as reward variance, value gradient, cost in anomaly detection, stationary distribution gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs evaluation and show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large. To our best knowledge, GenTD is the first off-policy GVF evaluation algorithm that has global optimality guarantee.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2019

A Convergent Off-Policy Temporal Difference Algorithm

Learning the value function of a given policy (target policy) from the d...
research
09/21/2018

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

In reinforcement learning (RL) , one of the key components is policy eva...
research
10/30/2010

Predictive State Temporal Difference Learning

We propose a new approach to value function approximation which combines...
research
06/07/2021

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Modern reinforcement learning (RL) commonly engages practical problems w...
research
10/15/2017

Manifold Regularization for Kernelized LSTD

Policy evaluation or value function or Q-function approximation is a key...
research
12/01/2021

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Value function approximation is a crucial module for policy evaluation i...
research
10/13/2021

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...

Please sign up or login with your details

Forgot password? Click here to reset