Adapting to Delays and Data in Adversarial Multi-Armed Bandits

10/12/2020
by   András György, et al.
0

We consider the adversarial multi-armed bandit problem under delayed feedback. We analyze variants of the Exp3 algorithm that tune their step-size using only information (about the losses and delays) available at the time of the decisions, and obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses. First, through a remarkably simple proof technique, we show that with proper tuning of the step size, the algorithm achieves an optimal (up to logarithmic factors) regret of order √(log(K)(TK + D)) both in expectation and in high probability, where K is the number of arms, T is the time horizon, and D is the cumulative delay. The high-probability version of the bound, which is the first high-probability delay-adaptive bound in the literature, crucially depends on the use of implicit exploration in estimating the losses. Then, following Zimmert and Seldin [2019], we extend these results so that the algorithm can "skip" rounds with large delays, resulting in regret bounds of order √(TKlog(K)) + |R| + √(D_R̅log(K)), where R is an arbitrary set of rounds (which are skipped) and D_R̅ is the cumulative delay of the feedback for other rounds. Finally, we present another, data-adaptive (AdaGrad-style) version of the algorithm for which the regret adapts to the observed (delayed) losses instead of only adapting to the cumulative delay (this algorithm requires an a priori upper bound on the maximum delay, or the advance knowledge of the delay for each decision when it is made). The resulting bound can be orders of magnitude smaller on benign problems, and it can be shown that the delay only affects the regret through the loss of the best arm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2019

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

We propose a new algorithm for adversarial multi-armed bandits with unre...
research
05/30/2023

Delayed Bandits: When Do Intermediate Observations Help?

We study a K-armed bandit with delayed feedback and intermediate observa...
research
06/10/2015

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

This work addresses the problem of regret minimization in non-stochastic...
research
05/17/2022

Delaytron: Efficient Learning of Multiclass Classifiers with Delayed Bandit Feedbacks

In this paper, we present online algorithm called Delaytron for learning...
research
11/02/2021

Nonstochastic Bandits and Experts with Arm-Dependent Delays

We study nonstochastic bandits and experts in a delayed setting where de...
research
05/01/2023

First- and Second-Order Bounds for Adversarial Linear Contextual Bandits

We consider the adversarial linear contextual bandit setting, which allo...
research
06/03/2019

Nonstochastic Multiarmed Bandits with Unrestricted Delays

We investigate multiarmed bandits with delayed feedback, where the delay...

Please sign up or login with your details

Forgot password? Click here to reset