Log In Sign Up

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

by   Andrew Patterson, et al.

Many reinforcement learning algorithms rely on value estimation. However, the most widely used algorithms – namely temporal difference algorithms – can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation which are sound under linear function approximation, based on the linear mean-squared projected Bellman error (PBE). Extending these methods to the non-linear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective, called the mean-squared Bellman error (BE), which naturally facilities nonlinear approximation. In this work, we build on these insights and introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting. We show how this generalized objective unifies previous work, including previous theory, and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.


Robust Losses for Learning Value Functions

Most value function learning algorithms in reinforcement learning are ba...

Gradient Temporal-Difference Learning with Regularized Corrections

It is still common to use Q-learning and temporal difference (TD) learni...

Analysis and Optimisation of Bellman Residual Errors with Neural Function Approximation

Recent development of Deep Reinforcement Learning has demonstrated super...

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Two timescale stochastic approximation (SA) has been widely used in valu...

Neural Temporal-Difference Learning Converges to Global Optima

Temporal-difference learning (TD), coupled with neural networks, is amon...

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Motivated by value function estimation in reinforcement learning, we stu...