Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

10/26/2020
by   Shaocong Ma, et al.
2

Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d. samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d. and Markovian samples. In the i.i.d. setting, our algorithm achieves a sample complexity O(ϵ^-3/5logϵ^-1) that is lower than the state-of-the-art result O(ϵ^-1logϵ^-1). In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity O(ϵ^-1logϵ^-1) that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2021

Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for opt...
research
01/07/2020

Reanalysis of Variance Reduced Temporal Difference Learning

Temporal difference (TD) learning is a popular algorithm for policy eval...
research
02/21/2017

Stochastic Canonical Correlation Analysis

We tightly analyze the sample complexity of CCA, provide a learning algo...
research
11/22/2021

Optimistic Temporal Difference Learning for 2048

Temporal difference (TD) learning and its variants, such as multistage T...
research
11/15/2022

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

In this paper, we revisit and improve the convergence of policy gradient...
research
01/19/2019

The Asymptotic Complexity of Coded-BKW with Sieving Using Increasing Reduction Factors

The Learning with Errors problem (LWE) is one of the main candidates for...
research
09/26/2019

Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Gradient-based temporal difference (GTD) algorithms are widely used in o...

Please sign up or login with your details

Forgot password? Click here to reset