Maillard Sampling: Boltzmann Exploration Done Optimally

11/05/2021
by   Jie Bian, et al.
10

The PhD thesis of Maillard (2013) presents a randomized algorithm for the K-armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a closed form, which is useful for counterfactual evaluation from bandit-logged data but was lacking from Thompson sampling, a widely-adopted bandit algorithm in the industry. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and √(KTlogT) minimax regret bound where T is the time horizon, which matches the standard asymptotically optimal UCB's performance. We then propose a variant of MS called MS^+ that improves its minimax bound to √(KTlogK) without losing the asymptotic optimality. MS^+ can also be tuned to be aggressive (i.e., less exploration) without losing theoretical guarantees, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS^+.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2020

MOTS: Minimax Optimal Thompson Sampling

Thompson sampling is one of the most widely used algorithms for many onl...
research
04/28/2023

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

We study K-armed bandit problems where the reward distributions of the a...
research
07/29/2020

An Index-based Deterministic Asymptotically Optimal Algorithm for Constrained Multi-armed Bandit Problems

For the model of constrained multi-armed bandit, we show that by constru...
research
02/21/2020

Double Explore-then-Commit: Asymptotic Optimality and Beyond

We study the two-armed bandit problem with subGaussian rewards. The expl...
research
08/14/2019

Thompson Sampling and Approximate Inference

We study the effects of approximate inference on the performance of Thom...
research
06/04/2018

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Learning the minimum/maximum mean among a finite set of distributions is...
research
04/06/2021

On the Optimality of Batch Policy Optimization Algorithms

Batch policy optimization considers leveraging existing data for policy ...

Please sign up or login with your details

Forgot password? Click here to reset