Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

06/12/2023
by   Jiayi Huang, et al.
0

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ϵ)-th moments for some ϵ∈(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, Heavy-OFUL, for heavy-tailed linear bandits, achieving an instance-dependent T-round regret of Õ(d T^1-ϵ/2(1+ϵ)√(∑_t=1^T ν_t^2) + d T^1-ϵ/2(1+ϵ)), the first of this kind. Here, d is the feature dimension, and ν_t^1+ϵ is the (1+ϵ)-th central moment of the reward at the t-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the first computationally efficient instance-dependent K-episode regret of Õ(d √(H 𝒰^*) K^1/1+ϵ + d √(H 𝒱^* K)). Here, H is length of the episode, and 𝒰^*, 𝒱^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(d H K^1/1+ϵ + d √(H^3 K)) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2020

Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst cas...
research
02/25/2021

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

Reinforcement learning algorithms typically assume rewards to be sampled...
research
06/01/2023

Differentially Private Episodic Reinforcement Learning with Heavy-tailed Rewards

In this paper, we study the problem of (finite horizon tabular) Markov d...
research
06/20/2023

Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

In a broad class of reinforcement learning applications, stochastic rewa...
research
12/07/2021

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Obtaining first-order regret bounds – regret bounds scaling not as the w...
research
07/06/2022

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

While much progress has been made in understanding the minimax sample co...
research
04/28/2020

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

In this paper, we study the problem of stochastic linear bandits with fi...

Please sign up or login with your details

Forgot password? Click here to reset