Expected Sarsa(λ) with Control Variate for Variance Reduction

06/25/2019
by   Long Yang, et al.
0

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning with function approximation falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to Expected Sarsa(λ) and propose a tabular ES(λ)-CV algorithm. We prove that if a proper estimator of value function reaches, the proposed ES(λ)-CV enjoys a lower variance than Expected Sarsa(λ). Furthermore, to extend ES(λ)-CV to be a convergent algorithm with linear function approximation, we propose the GES(λ) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of GES(λ) achieves O(1/T), which matches or outperforms several state-of-art gradient-based algorithms, but we use a more relaxed step-size. Numerical experiments show that the proposed algorithm is stable and converges faster with lower variance than several state-of-art gradient-based TD learning algorithms: GQ(λ), GTB(λ) and ABQ(ζ).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2017

Stochastic Variance Reduction Methods for Policy Evaluation

Policy evaluation is a crucial step in many reinforcement-learning proce...
research
09/21/2018

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

In reinforcement learning (RL) , one of the key components is policy eva...
research
05/25/2017

Convergent Tree-Backup and Retrace with Function Approximation

Off-policy learning is key to scaling up reinforcement learning as it al...
research
11/25/2022

Operator Splitting Value Iteration

We introduce new planning and reinforcement learning algorithms for disc...
research
06/02/2021

An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

Off-policy prediction – learning the value function for one policy from ...
research
11/29/2022

Closing the gap between SVRG and TD-SVRG with Gradient Splitting

Temporal difference (TD) learning is a simple algorithm for policy evalu...
research
04/16/2019

Global Error Bounds and Linear Convergence for Gradient-Based Algorithms for Trend Filtering and ℓ_1-Convex Clustering

We propose a class of first-order gradient-type optimization algorithms ...

Please sign up or login with your details

Forgot password? Click here to reset