Is the Bellman residual a bad proxy?

06/24/2016
by   Matthieu Geist, et al.
0

This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual T_* v_π - v_π_1,ν over policies. A theoretical analysis shows how good this proxy is to policy optimization, and notably that it is better than its value-based counterpart. We also propose experiments on randomly generated generic Markov decision processes, specifically designed for studying the influence of the involved concentrability coefficient. They show that the Bellman residual is generally a bad proxy to policy optimization and that directly maximizing the mean value is much better, despite the current lack of deep theoretical analysis. This might seem obvious, as directly addressing the problem of interest is usually better, but given the prevalence of (projected) Bellman residual minimization in value-based reinforcement learning, we believe that this question is worth to be considered.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2022

On the connection between Bregman divergence and value in regularized Markov decision processes

In this short note we derive a relationship between the Bregman divergen...
research
10/09/2019

Policy Optimization Through Approximated Importance Sampling

Recent policy optimization approaches (Schulman et al., 2015a, 2017) hav...
research
05/25/2022

An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation

Gradient descent or its variants are popular in training neural networks...
research
01/28/2022

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Safe Policy Improvement (SPI) aims at provable guarantees that a learned...
research
02/07/2020

Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts

Informed and robust decision making in the face of uncertainty is critic...
research
11/19/2010

Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view

We investigate projection methods, for evaluating a linear approximation...
research
05/30/2023

Solving Robust MDPs through No-Regret Dynamics

Reinforcement Learning is a powerful framework for training agents to na...

Please sign up or login with your details

Forgot password? Click here to reset