Robustness of Anytime Bandit Policies

07/22/2011
by   Antoine Salomon, et al.
0

This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1-1/n, the regret of the policy is of order log(n). They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/07/2019

Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits

We study agents communicating over an underlying network by exchanging m...
research
07/17/2018

Continuous Assortment Optimization with Logit Choice Probabilities under Incomplete Information

We consider assortment optimization in relation to a product for which a...
research
12/01/2017

Novel Exploration Techniques (NETs) for Malaria Policy Interventions

The task of decision-making under uncertainty is daunting, especially fo...
research
06/28/2019

Adaptive Sequential Experiments with Unknown Information Flows

Systems that make sequential decisions in the presence of partial feedba...
research
01/25/2019

Almost Boltzmann Exploration

Boltzmann exploration is widely used in reinforcement learning to provid...
research
06/07/2022

The Survival Bandit Problem

We study the survival bandit problem, a variant of the multi-armed bandi...
research
08/21/2019

Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

The (contextual) multi-armed bandit problem (MAB) provides a formalizati...

Please sign up or login with your details

Forgot password? Click here to reset