Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

06/07/2022
by   Tianyuan Jin, et al.
8

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a K-armed bandit with exponential family rewards, ExpTS over a horizon T is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor √(log K), and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS^+, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS^+ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2023

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

We study K-armed bandit problems where the reward distributions of the a...
research
03/03/2020

MOTS: Minimax Optimal Thompson Sampling

Thompson sampling is one of the most widely used algorithms for many onl...
research
07/12/2013

Thompson Sampling for 1-Dimensional Exponential Family Bandits

Thompson Sampling has been demonstrated in many complex bandit models, h...
research
01/30/2020

Finite-time Analysis of Kullback-Leibler Upper Confidence Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian Rewards

We study an extension of the classic stochastic multi-armed bandit probl...
research
01/06/2016

On Bayesian index policies for sequential resource allocation

This paper is about index policies for minimizing (frequentist) regret i...
research
06/15/2020

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

We study stochastic structured bandits for minimizing regret. The fact t...
research
04/06/2021

On the Optimality of Batch Policy Optimization Algorithms

Batch policy optimization considers leveraging existing data for policy ...

Please sign up or login with your details

Forgot password? Click here to reset