Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

04/28/2023
by   Hao Qin, et al.
0

We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0,1] interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling <cit.>, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting <cit.> while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form O(√(μ^*(1-μ^*) K T ln K) + K ln T), where μ^* is the expected reward of the optimal arm, and T is the time horizon length.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2022

Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

We study the regret of Thompson sampling (TS) algorithms for exponential...
research
09/14/2020

Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

In the regret-based formulation of multi-armed bandit (MAB) problems, ex...
research
03/19/2019

A Note on KL-UCB+ Policy for the Stochastic Bandit

A classic setting of the stochastic K-armed bandit problem is considered...
research
11/05/2021

Maillard Sampling: Boltzmann Exploration Done Optimally

The PhD thesis of Maillard (2013) presents a randomized algorithm for th...
research
10/01/2021

Batched Thompson Sampling

We introduce a novel anytime Batched Thompson sampling policy for multi-...
research
03/01/2018

The K-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates

In this paper we propose and explore the k-Nearest Neighbour UCB algorit...
research
04/16/2018

UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic Bandits

In this work, we address the open problem of finding low-complexity near...

Please sign up or login with your details

Forgot password? Click here to reset