DeepAI
Log In Sign Up

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

04/28/2021
by   Andrew Patterson, et al.
0

Many reinforcement learning algorithms rely on value estimation. However, the most widely used algorithms – namely temporal difference algorithms – can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation which are sound under linear function approximation, based on the linear mean-squared projected Bellman error (PBE). Extending these methods to the non-linear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective, called the mean-squared Bellman error (BE), which naturally facilities nonlinear approximation. In this work, we build on these insights and introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting. We show how this generalized objective unifies previous work, including previous theory, and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

READ FULL TEXT
05/17/2022

Robust Losses for Learning Value Functions

Most value function learning algorithms in reinforcement learning are ba...
07/01/2020

Gradient Temporal-Difference Learning with Regularized Corrections

It is still common to use Q-learning and temporal difference (TD) learni...
06/16/2021

Analysis and Optimisation of Bellman Residual Errors with Neural Function Approximation

Recent development of Deep Reinforcement Learning has demonstrated super...
11/10/2020

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Two timescale stochastic approximation (SA) has been widely used in valu...
05/24/2019

Neural Temporal-Difference Learning Converges to Global Optima

Temporal-difference learning (TD), coupled with neural networks, is amon...
06/27/2012

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Motivated by value function estimation in reinforcement learning, we stu...