Neural Temporal-Difference Learning Converges to Global Optima

05/24/2019
by   Qi Cai, et al.
0

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

READ FULL TEXT
research
06/08/2020

Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

Temporal-difference and Q-learning play a key role in deep reinforcement...
research
06/25/2019

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO a...
research
02/15/2020

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), th...
research
03/01/2019

Should All Temporal Difference Learning Use Emphasis?

Emphatic Temporal Difference (ETD) learning has recently been proposed a...
research
03/08/2020

Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Generative adversarial imitation learning (GAIL) demonstrates tremendous...
research
01/30/2023

A Novel Framework for Policy Mirror Descent with General Parametrization and Linear Convergence

Modern policy optimization methods in applied reinforcement learning, su...
research
07/19/2021

Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...

Please sign up or login with your details

Forgot password? Click here to reset