Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

06/25/2019
by   Boyi Liu, et al.
0

Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2020

Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy

We study the global convergence and global optimality of actor-critic, o...
research
10/26/2021

Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO

Policy optimization is a fundamental principle for designing reinforceme...
research
01/24/2022

TOPS: Transition-based VOlatility-controlled Policy Search and its Global Convergence

Risk-averse problems receive far less attention than risk-neutral contro...
research
05/24/2019

Neural Temporal-Difference Learning Converges to Global Optima

Temporal-difference learning (TD), coupled with neural networks, is amon...
research
10/29/2021

Understanding the Effect of Stochasticity in Policy Optimization

We study the effect of stochasticity in on-policy policy optimization, a...
research
03/08/2020

Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Generative adversarial imitation learning (GAIL) demonstrates tremendous...
research
05/29/2018

Supervised Policy Update

We propose a new sample-efficient methodology, called Supervised Policy ...

Please sign up or login with your details

Forgot password? Click here to reset