A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L-λ Smoothness

07/29/2023
by   Hengshuai Yao, et al.
0

Gradient Temporal Difference (GTD) algorithms (Sutton et al., 2008, 2009) are the first O(d) (d is the number features) algorithms that have convergence guarantees for off-policy learning with linear function approximation. Liu et al. (2015) and Dalal et. al. (2018) proved the convergence rates of GTD, GTD2 and TDC are O(t^-α/2) for some α∈ (0,1). This bound is tight (Dalal et al., 2020), and slower than O(1/√(t)). GTD algorithms also have two step-size parameters, which are difficult to tune. In literature, there is a "single-time-scale" formulation of GTD. However, this formulation still has two step-size parameters. This paper presents a truly single-time-scale GTD algorithm for minimizing the Norm of Expected td Update (NEU) objective, and it has only one step-size parameter. We prove that the new algorithm, called Impression GTD, converges at least as fast as O(1/t). Furthermore, based on a generalization of the expected smoothness (Gower et al. 2019), called L-λ smoothness, we are able to prove that the new GTD converges even faster, in fact, with a linear rate. Our rate actually also improves Gower et al.'s result with a tighter bound under a weaker assumption. Besides Impression GTD, we also prove the rates of three other GTD algorithms, one by Yao and Liu (2008), another called A-transpose-TD (Sutton et al., 2008), and a counterpart of A-transpose-TD. The convergence rates of all the four GTD algorithms are proved in a single generic GTD framework to which L-λ smoothness applies. Empirical results on Random walks, Boyan chain, and Baird counterexample show that Impression GTD converges much faster than existing GTD algorithms for both on-policy and off-policy learning problems, with well-performing step-sizes in a big range.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset