The Typical Behavior of Bandit Algorithms

10/11/2022
by   Lin Fan, et al.
0

We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution complement the characterizations of the tail of the regret distribution recently developed by Fan and Glynn (2021) (arXiv:2109.13595). The tail characterizations there are associated with atypical bandit behavior on trajectories where the optimal arm mean is under-estimated, leading to mis-identification of the optimal arm and large regret. In contrast, our SLLN's and CLT's here describe the typical behavior and fluctuation of regret on trajectories where the optimal arm mean is properly estimated. We find that Thompson sampling and UCB satisfy the same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the CLT matching the asymptotics of expected regret. Both the mean and variance in the CLT grow at log(T) rates with the time horizon T. Asymptotically as T →∞, the variability in the number of plays of each sub-optimal arm depends only on the rewards received for that arm, which indicates that each sub-optimal arm contributes independently to the overall CLT variance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2021

The Fragility of Optimized Bandit Algorithms

Much of the literature on optimal design of bandit algorithms is based o...
research
07/12/2021

Continuous Time Bandits With Sampling Costs

We consider a continuous-time multi-arm bandit problem (CTMAB), where th...
research
11/18/2021

Optimal Simple Regret in Bayesian Best Arm Identification

We consider Bayesian best arm identification in the multi-armed bandit p...
research
03/23/2021

Bandits with many optimal arms

We consider a stochastic bandit problem with a possibly infinite number ...
research
05/19/2021

Diffusion Approximations for Thompson Sampling

We study the behavior of Thompson sampling from the perspective of weak ...
research
07/17/2020

Bandits for BMO Functions

We study the bandit problem where the underlying expected reward is a Bo...
research
05/10/2022

Risk Aversion In Learning Algorithms and an Application To Recommendation Systems

Consider a bandit learning environment. We demonstrate that popular lear...

Please sign up or login with your details

Forgot password? Click here to reset