On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

10/18/2019
by   Harshat Kumar, et al.
0

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2020

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Actor-critic (AC) methods have exhibited great empirical success compare...
research
04/09/2018

Policy Gradient With Value Function Approximation For Collective Multiagent Planning

Decentralized (PO)MDPs provide an expressive framework for sequential de...
research
02/20/2022

Learning to Control Partially Observed Systems with Finite Memory

We consider the reinforcement learning problem for partially observed Ma...
research
04/05/2019

Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards w...
research
09/29/2021

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

We study a novel two-time-scale stochastic gradient method for solving o...
research
09/10/2015

Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

This paper proposes GProp, a deep reinforcement learning algorithm for c...
research
02/16/2022

Policy Learning and Evaluation with Randomized Quasi-Monte Carlo

Reinforcement learning constantly deals with hard integrals, for example...

Please sign up or login with your details

Forgot password? Click here to reset