Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy

by   Zuyue Fu, et al.

We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear O(K^-1/2) rate, where K is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.



There are no comments yet.


page 1

page 2

page 3

page 4


Neural Policy Gradient Methods: Global Optimality and Rates of Convergence

Policy gradient methods with actor-critic schemes demonstrate tremendous...

Global Convergence of the ODE Limit for Online Actor-Critic Algorithms in Reinforcement Learning

Actor-critic algorithms are widely used in reinforcement learning, but a...

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Actor-critic (AC) algorithms, empowered by neural networks, have had sig...

Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

While deep reinforcement learning has achieved tremendous successes in v...

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

We prove under commonly used assumptions the convergence of actor-critic...

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO a...

A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

This paper analyzes a two-timescale stochastic algorithm for a class of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.