Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

06/08/2020
by   Yufeng Zhang, et al.
0

Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve? If it converges, does it converge to the optimal one? We prove that, utilizing an overparameterized two-layer neural network, temporal-difference and Q-learning globally minimize the mean-squared projected Bellman error at a sublinear rate. Moreover, the associated feature representation converges to the optimal one, generalizing the previous analysis of Cai et al. (2019) in the neural tangent kernel regime, where the associated feature representation stabilizes at the initial one. The key to our analysis is a mean-field perspective, which connects the evolution of a finite-dimensional parameter to its limiting counterpart over an infinite-dimensional Wasserstein space. Our analysis generalizes to soft Q-learning, which is further connected to policy gradient.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2021

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Actor-critic (AC) algorithms, empowered by neural networks, have had sig...
research
05/24/2019

Neural Temporal-Difference Learning Converges to Global Optima

Temporal-difference learning (TD), coupled with neural networks, is amon...
research
01/18/2022

Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

We study the global convergence of policy gradient for infinite-horizon,...
research
07/31/2017

Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning

One question central to Reinforcement Learning is how to learn a feature...
research
10/22/2020

Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime

We study the problem of policy optimization for infinite-horizon discoun...
research
09/15/2022

Understanding Deep Neural Function Approximation in Reinforcement Learning via ε-Greedy Exploration

This paper provides a theoretical study of deep neural function approxim...
research
04/20/2022

Exact Formulas for Finite-Time Estimation Errors of Decentralized Temporal Difference Learning with Linear Function Approximation

In this paper, we consider the policy evaluation problem in multi-agent ...

Please sign up or login with your details

Forgot password? Click here to reset