Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

06/20/2023
āˆ™
by   Semih Cayci, et al.
āˆ™
0
āˆ™

In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order š’Ŗ(Īµ^-1/p) and š’Ŗ(Īµ^-1-1/p) with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order (1+p) for some pāˆˆ(0,1], both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves š’ŖĢƒ(Īµ^-4-2/p) sample complexity. We corroborate our theoretical results with numerical experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
āˆ™ 02/25/2021

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

Reinforcement learning algorithms typically assume rewards to be sampled...
research
āˆ™ 02/20/2021

On Proximal Policy Optimization's Heavy-tailed Gradients

Modern policy gradient algorithms, notably Proximal Policy Optimization ...
research
āˆ™ 06/12/2023

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

While numerous works have focused on devising efficient algorithms for r...
research
āˆ™ 07/25/2023

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

Gradient clipping is a commonly used technique to stabilize the training...
research
āˆ™ 06/27/2022

Efficient Private SCO for Heavy-Tailed Data via Clipping

We consider stochastic convex optimization for heavy-tailed data with th...
research
āˆ™ 06/02/2022

Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm

Natural actor-critic (NAC) and its variants, equipped with the represent...
research
āˆ™ 01/20/2022

Heavy-tailed Sampling via Transformed Unadjusted Langevin Algorithm

We analyze the oracle complexity of sampling from polynomially decaying ...

Please sign up or login with your details

Forgot password? Click here to reset