Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

09/26/2019
by   Tengyu Xu, et al.
0

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d. Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/(t^(2/3))) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2020

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Two timescale stochastic approximation (SA) has been widely used in valu...
research
08/18/2023

Baird Counterexample Is Solved: with an example of How to Debug a Two-time-scale Algorithm

Baird counterexample was proposed by Leemon Baird in 1995, first used to...
research
05/12/2020

On the rate of convergence of the Gaver-Stehfest algorithm

The Gaver-Stehfest algorithm is widely used for numerical inversion of L...
research
10/26/2020

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

Variance reduction techniques have been successfully applied to temporal...
research
04/16/2021

Distributed TD(0) with Almost No Communication

We provide a new non-asymptotic analysis of distributed TD(0) with linea...
research
08/03/2016

Fast and Simple Optimization for Poisson Likelihood Models

Poisson likelihood models have been prevalently used in imaging, social ...
research
02/14/2022

On the Chattering of SARSA with Linear Function Approximation

SARSA, a classical on-policy control algorithm for reinforcement learnin...

Please sign up or login with your details

Forgot password? Click here to reset