Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

01/28/2022
by   Scott Fujimoto, et al.
0

In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.

READ FULL TEXT

page 5

page 13

page 20

page 21

page 22

page 23

research
02/05/2020

Deep RBF Value Functions for Continuous Control

A core operation in reinforcement learning (RL) is finding an action tha...
research
08/28/2018

High-confidence error estimates for learned value functions

Estimating the value function for a fixed policy is a fundamental proble...
research
08/15/2020

Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of mode...
research
07/14/2022

Approximation of the value function for optimal control problems on stratified domains

In optimal control problems defined on stratified domains, the dynamics ...
research
12/01/2021

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Value function approximation is a crucial module for policy evaluation i...
research
03/02/2021

Sample Complexity and Overparameterization Bounds for Projection-Free Neural TD Learning

We study the dynamics of temporal-difference learning with neural networ...
research
11/13/2022

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification

Data valuation, or the valuation of individual datum contributions, has ...

Please sign up or login with your details

Forgot password? Click here to reset