Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation

08/05/2022
by   Zaiwei Chen, et al.
0

In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules, including natural policy gradient (NPG), are considered for policy update. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the λ-averaged Q-trace and the two-sided Q-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy update, we provide a universal analysis using only the contraction property and the monotonicity property of the Bellman operator to establish the geometric convergence under various policy update rules. Importantly, by viewing NPG as an approximate way of implementing policy iteration, we establish the geometric convergence of NPG without introducing regularization, and without using mirror descent type of analysis as in existing literature. Combining the geometric convergence of the policy update with the finite-sample analysis of the policy evaluation, we establish for the first time an overall 𝒪(ϵ^-2) sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2021

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

In temporal difference (TD) learning, off-policy sampling is known to be...
research
05/25/2017

Convergent Tree-Backup and Retrace with Function Approximation

Off-policy learning is key to scaling up reinforcement learning as it al...
research
03/27/2020

A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms

We present a distributional approach to theoretical analyses of reinforc...
research
02/18/2021

Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

In this paper, we provide finite-sample convergence guarantees for an of...
research
05/30/2019

Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

We study the sample complexity of approximate policy iteration (PI) for ...
research
05/14/2008

Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration

Several approximate policy iteration schemes without value functions, wh...
research
02/03/2020

Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes

Stochastic Approximation (SA) is a popular approach for solving fixed po...

Please sign up or login with your details

Forgot password? Click here to reset