Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

by   Tengyu Xu, et al.

Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with diminishing or accuracy-dependent stepsize. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with accuracy-independent constant stepsize. For linear TDC, we provide a novel non-asymptotic analysis and show that it attains an ϵ-accurate solution with the optimal sample complexity of 𝒪(ϵ^-1log(1/ϵ)) under a constant stepsize. For nonlinear TDC and Greedy-GQ, we show that both algorithms attain ϵ-accurate stationary solution with sample complexity 𝒪(ϵ^-2). It is the first non-asymptotic convergence result established for nonlinear TDC under Markovian sampling and our result for Greedy-GQ outperforms the previous result orderwisely by a factor of 𝒪(ϵ^-1log(1/ϵ)).


page 1

page 2

page 3

page 4


Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Gradient-based temporal difference (GTD) algorithms are widely used in o...

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), th...

Reinforcement Learning with Function Approximation: From Linear to Nonlinear

Function approximation has been an indispensable component in modern rei...

Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for opt...

Multi-Agent Off-Policy TD Learning: Finite-Time Analysis with Near-Optimal Sample Complexity and Communication Complexity

The finite-time convergence of off-policy TD learning has been comprehen...

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Policy optimization methods are popular reinforcement learning algorithm...

Non-asymptotic and Accurate Learning of Nonlinear Dynamical Systems

We consider the problem of learning stabilizable systems governed by non...

Please sign up or login with your details

Forgot password? Click here to reset