Average-Reward Off-Policy Policy Evaluation with Function Approximation

01/08/2021
by   Shangtong Zhang, et al.
0

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2021

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...
research
06/29/2020

Learning and Planning in Average-Reward Markov Decision Processes

We introduce improved learning and planning algorithms for average-rewar...
research
09/30/2022

On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly-Communicating MDPs

We show two average-reward off-policy control algorithms, Differential Q...
research
01/05/2022

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Estimating value functions is a core component of reinforcement learning...
research
12/01/2021

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Value function approximation is a crucial module for policy evaluation i...
research
06/09/2019

SVRG for Policy Evaluation with Fewer Gradient Evaluations

Stochastic variance-reduced gradient (SVRG) is an optimization method or...
research
10/27/2022

Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

Off-policy evaluation often refers to two related tasks: estimating the ...

Please sign up or login with your details

Forgot password? Click here to reset