On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

12/27/2017
by   Huizhen Yu, et al.
0

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard type of diminishing stepsize with a square-summable condition. For the first two types of stepsizes, we apply the weak convergence method from stochastic approximation theory to characterize the asymptotic behavior of the algorithms, and for the standard type of stepsize, we analyze the algorithmic behavior with respect to a stronger mode of convergence, almost sure convergence. Our convergence results are for the aforementioned TD algorithms with three general ways of setting their λ-parameters: (i) state-dependent λ; (ii) a recently proposed scheme of using history-dependent λ to keep the eligibility traces of the algorithms bounded while allowing for relatively large values of λ; and (iii) a composite scheme of setting the λ-parameters that combines the preceding two schemes and allows a broader class of generalized Bellman operators to be used for approximate policy evaluation with TD methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2022

Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation

We provide performance guarantees for a variant of simulation-based poli...
research
01/19/2022

Critic Algorithms using Cooperative Networks

An algorithm is proposed for policy evaluation in Markov Decision Proces...
research
01/01/2013

Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes

In this paper we extend temporal difference policy evaluation algorithms...
research
04/15/2013

Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that...
research
05/07/2019

Estimate Sequences for Variance-Reduced Stochastic Composite Optimization

In this paper, we propose a unified view of gradient-based algorithms fo...
research
03/16/2020

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decisio...
research
02/16/2021

Improper Learning with Gradient-based Policy Optimization

We consider an improper reinforcement learning setting where the learner...

Please sign up or login with your details

Forgot password? Click here to reset