Revisiting Simple Regret Minimization in Multi-Armed Bandits

10/30/2022
by   Yao Zhao, et al.
0

Simple regret is a natural and parameter-free performance criterion for identifying a good arm in multi-armed bandits yet is less popular than the probability of missing the best arm or an ϵ-good arm, perhaps due to lack of easy ways to characterize it. In this paper, we achieve improved simple regret upper bounds for both data-rich (T≥ n) and data-poor regime (T ≤ n) where n is the number of arms and T is the number of samples. At its heart is an improved analysis of the well-known Sequential Halving (SH) algorithm that bounds the probability of returning an arm whose mean reward is not within ϵ from the best (i.e., not ϵ-good) for any choice of ϵ>0, although ϵ is not an input to SH. We show that this directly implies an optimal simple regret bound of 𝒪(√(n/T)). Furthermore, our upper bound gets smaller as a function of the number of ϵ-good arms. This results in an accelerated rate for the (ϵ,δ)-PAC criterion, which closes the gap between the upper and lower bounds in prior art. For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2019

Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

In this paper, we propose a constant word (RAM model) algorithm for regr...
research
01/24/2019

PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits

We consider the problem of identifying any k out of the best m arms in a...
research
10/25/2022

PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits

In sparse linear bandits, a learning agent sequentially selects an actio...
research
01/08/2013

Linear Bandits in High Dimension and Recommendation Systems

A large number of online services provide automated recommendations to h...
research
06/10/2015

On the Prior Sensitivity of Thompson Sampling

The empirically successful Thompson Sampling algorithm for stochastic ba...
research
05/30/2022

Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow...
research
11/17/2021

Max-Min Grouped Bandits

In this paper, we introduce a multi-armed bandit problem termed max-min ...

Please sign up or login with your details

Forgot password? Click here to reset