Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

01/22/2021
by   Lai Wei, et al.
0

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative rewards obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We extend Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window and discount factor and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions the proposed polices that can handle heavy-tailed reward distributions and maintain their performance guarantees.

READ FULL TEXT
research
07/20/2020

Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst cas...
research
06/11/2020

Grooming a Single Bandit Arm

The stochastic multi-armed bandit problem captures the fundamental explo...
research
12/04/2020

One-bit feedback is sufficient for upper confidence bound policies

We consider a variant of the traditional multi-armed bandit problem in w...
research
06/01/2022

Contextual Bandits with Knapsacks for a Conversion Model

We consider contextual bandits with knapsacks, with an underlying struct...
research
06/07/2022

A Simple and Optimal Policy Design with Safety against Heavy-tailed Risk for Multi-armed Bandits

We design new policies that ensure both worst-case optimality for expect...
research
04/09/2012

Knapsack based Optimal Policies for Budget-Limited Multi-Armed Bandits

In budget-limited multi-armed bandit (MAB) problems, the learner's actio...
research
05/28/2019

Repeated A/B Testing

We study a setting in which a learner faces a sequence of A/B tests and ...

Please sign up or login with your details

Forgot password? Click here to reset