DeepAI AI Chat
Log In Sign Up

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

by   Shaocong Ma, et al.

Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d. samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d. and Markovian samples. In the i.i.d. setting, our algorithm achieves a sample complexity O(ϵ^-3/5logϵ^-1) that is lower than the state-of-the-art result O(ϵ^-1logϵ^-1). In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity O(ϵ^-1logϵ^-1) that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.


page 1

page 2

page 3

page 4


Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for opt...

Reanalysis of Variance Reduced Temporal Difference Learning

Temporal difference (TD) learning is a popular algorithm for policy eval...

Stochastic Canonical Correlation Analysis

We tightly analyze the sample complexity of CCA, provide a learning algo...

Optimistic Temporal Difference Learning for 2048

Temporal difference (TD) learning and its variants, such as multistage T...

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

In this paper, we revisit and improve the convergence of policy gradient...

The Asymptotic Complexity of Coded-BKW with Sieving Using Increasing Reduction Factors

The Learning with Errors problem (LWE) is one of the main candidates for...

Closing the gap between SVRG and TD-SVRG with Gradient Splitting

Temporal difference (TD) learning is a simple algorithm for policy evalu...