Temporal-difference learning for nonlinear value function approximation in the lazy training regime

05/27/2019
by   Andrea Agazzi, et al.
0

We discuss the approximation of the value function for infinite-horizon discounted Markov Decision Processes (MDP) with nonlinear functions trained with Temporal-Difference (TD) learning algorithm. We consider this problem under a certain scaling of the approximating function, leading to a regime called lazy training. In this regime the parameters of the model vary only slightly during the learning process, a feature that has recently been observed in the training of neural networks, where the scaling we study arises naturally, implicit in the initialization of their parameters. Both in the under- and over-parametrized frameworks, we prove exponential convergence to local, respectively global minimizers of the above algorithm in the lazy training regime. We then give examples of such convergence results in the case of models that diverge if trained with non-lazy TD learning, and in the case of neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2012

Value Function Approximation in Zero-Sum Markov Games

This paper investigates value function approximation in the context of z...
research
10/27/2021

Finite Horizon Q-learning: Stability, Convergence and Simulations

Q-learning is a popular reinforcement learning algorithm. This algorithm...
research
05/29/2019

On the Expected Dynamics of Nonlinear TD Learning

While there are convergence guarantees for temporal difference (TD) lear...
research
02/06/2022

Computing Transience Bounds of Emergency Call Centers: a Hierarchical Timed Petri Net Approach

A fundamental issue in the analysis of emergency call centers is to esti...
research
03/29/2017

On Convergence Property of Implicit Self-paced Objective

Self-paced learning (SPL) is a new methodology that simulates the learni...
research
03/13/2023

n-Step Temporal Difference Learning with Optimal n

We consider the problem of finding the optimal value of n in the n-step ...
research
03/02/2021

Sample Complexity and Overparameterization Bounds for Projection-Free Neural TD Learning

We study the dynamics of temporal-difference learning with neural networ...

Please sign up or login with your details

Forgot password? Click here to reset