Stochastic Contextual Bandits with Long Horizon Rewards

02/02/2023
by   Yuzhen Qin, et al.
0

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T<h) and data-rich (T≥ h) regimes, and derive respective regret upper bounds Õ(d√(sT) +min{ q, T}) and Õ(√(sdT)), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2020

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Episodic reinforcement learning and contextual bandits are two widely st...
research
10/15/2021

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

We study the optimal batch-regret tradeoff for batch linear contextual b...
research
02/09/2022

Smoothed Online Learning is as Easy as Statistical Learning

Much of modern learning theory has been split between two regimes: the c...
research
02/23/2017

Rotting Bandits

The Multi-Armed Bandits (MAB) framework highlights the tension between a...
research
05/01/2020

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Learning to plan for long horizons is a central challenge in episodic re...
research
03/04/2020

Taking a hint: How to leverage loss predictors in contextual bandits?

We initiate the study of learning in contextual bandits with the help of...

Please sign up or login with your details

Forgot password? Click here to reset