Log In Sign Up

Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity

by   Shaocong Ma, et al.

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an Ο΅-stationary point with a sample complexity in the order of π’ͺ(Ο΅^-3). Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of π’ͺ(Ο΅^-2). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.


page 1

page 2

page 3

page 4

βˆ™ 10/26/2020

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

Variance reduction techniques have been successfully applied to temporal...
βˆ™ 05/20/2020

Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise

Greedy-GQ is an off-policy two timescale algorithm for optimal control i...
βˆ™ 11/21/2020

On the Convergence of Reinforcement Learning

We consider the problem of Reinforcement Learning for nonlinear stochast...
βˆ™ 10/30/2022

Robust Data Valuation via Variance Reduced Data Shapley

Data valuation, especially quantifying data value in algorithmic predict...
βˆ™ 11/10/2020

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Two timescale stochastic approximation (SA) has been widely used in valu...
βˆ™ 06/22/2021

Local policy search with Bayesian optimization

Reinforcement learning (RL) aims to find an optimal policy by interactio...
βˆ™ 12/22/2017

Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Reinforcement learning (RL) has been successfully used to solve many con...